Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
This is the sixth in a series of seven lectures. Last lecture we concluded our journey with data wrangling in the world of regular expressions. This week we will again explore that "clean" data that we've learned to generate using the world of statistical models. At the end of this session you will be able to perform simple and multiple linear regression, one- and multi-way analysis of variance (ANOVA) and analysis of covariance (ANCOVA). You will be able to interpret the statistics that come out of this model, be cognizant of the assumptions made by the model, and use F-test and Akaike Information Criterium (AIC) to select the best model for the job.
Grey background: Command-line code, R library and function names... fill in the code here if you are coding alongEach week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.
The dataset we will use for this lesson is from the Summer Institute in Statistical Genetics (SISG) at the University of Washington's course in Regression and Analysis of Variance by Lurdes Inoue. This lesson uses a lot of material from the SISG 2016 course as well as conceptual material from Ben Bolker. I like this dataset because it has a number of categorical and continuous variables, which allows us to use the same dataset for many models. Also, the variables are familiar (age, BMI, sex, cholesterol), which makes data interpretation easier while we are in the learning stage.
We'll read the data in and take a look at the structure.
A space-separated file consisting of measurements related to lipid health and haplotype data for variants at loci that may influence these values. We'll be using this dataset to explore and model relationships between genotype and phenotype.
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2). In particular we will be taking advantage of the stringr package this week.
car a companion to applied regression which has helpful tools for the analysis of variance.
gee a generalized estimation equation solver for fitting your data
multcomp general linear hypothesis testing in parametric models
broom useful for formatting model output from linear models
Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
conda install -c conda-forge r-biocmanager
BiocManager::install("limma")
conda install -c conda-forge r-gee
conda install -c conda-forge r-multcomp
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
# None of these packages are already available on JupyterHub
# install.packages("car", dependencies = TRUE) # ~6 minutes to install all dependencies
# install.packages("gee", dependencies = TRUE) # ~ 3 minutes to install all dependencies
#--------- Load packages to for today's session ----------#
library(tidyverse)
library(car)
library(multcomp) # used for glht()
library(gee) # Generalized Estimation Equation Solver for our Appendix section
library(broom)
Warning message: "package 'tidyverse' was built under R version 4.0.5" -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.3 v purrr 0.3.4 v tibble 3.1.1 v dplyr 1.0.6 v tidyr 1.1.3 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.1 Warning message: "package 'ggplot2' was built under R version 4.0.5" Warning message: "package 'tibble' was built under R version 4.0.5" Warning message: "package 'tidyr' was built under R version 4.0.5" Warning message: "package 'dplyr' was built under R version 4.0.5" Warning message: "package 'forcats' was built under R version 4.0.5" -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Warning message: "package 'car' was built under R version 4.0.5" Loading required package: carData Attaching package: 'car' The following object is masked from 'package:dplyr': recode The following object is masked from 'package:purrr': some Loading required package: mvtnorm Loading required package: survival Loading required package: TH.data Loading required package: MASS Attaching package: 'MASS' The following object is masked from 'package:dplyr': select Attaching package: 'TH.data' The following object is masked from 'package:MASS': geyser Warning message: "package 'gee' was built under R version 4.0.5" Warning message: "package 'broom' was built under R version 4.0.5"
I am not a statistician and likely have just enough knowledge to realize that I mostly know nothing about statistics. The tools and methods we discuss today are to familiarize you with the concept of creating simple linear models but you should always think deeply about your data and approach it with the right statistical toolset.
Before embarking on your journey, ask if your data meets all the criteria for this type of analysis. Read papers on similar subjects and see what the consensus is!
Lastly, don't get trapped on Mount Stupid of the Dunning-Kruger curve! We're aiming for the Valley! To despair... and beyond!
In order to work with our data we need to be able to answer some basic questions.
These are all really important questions that we may or may not think about as we try to dive in and get our answer as quickly as possible. Today we are going to slow down a bit and think about our data and our models.
Let's load our dataset today with a new function, read.delim(), where we will specify that the file is space-separated with a header.
# Load our data with a new function: read.delim
# It will recognize the missing column header and make the first column of data into row names
cholesterol <- read.delim("data/SISG-Data-cholesterol.txt", sep = " ", header=TRUE)
# Check the structure of our data set
str(cholesterol)
# What does it look like
head(cholesterol)
'data.frame': 400 obs. of 9 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ sex : int 1 1 0 0 1 1 0 0 0 0 ... $ age : int 74 51 64 34 52 39 79 38 52 58 ... $ chol : int 215 204 205 182 175 176 159 169 175 189 ... $ BMI : num 26.2 24.7 24.2 23.8 34.1 22.7 22.9 24.9 20.4 22 ... $ TG : int 367 150 213 111 328 53 274 137 125 209 ... $ rs174548 : int 1 2 0 1 0 0 2 1 0 0 ... $ rs4775401: int 2 1 1 1 0 2 1 1 1 1 ... $ APOE : int 4 4 4 1 1 4 1 1 4 5 ...
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE | |
|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <dbl> | <int> | <int> | <int> | <int> | |
| 1 | 1 | 1 | 74 | 215 | 26.2 | 367 | 1 | 2 | 4 |
| 2 | 2 | 1 | 51 | 204 | 24.7 | 150 | 2 | 1 | 4 |
| 3 | 3 | 0 | 64 | 205 | 24.2 | 213 | 0 | 1 | 4 |
| 4 | 4 | 0 | 34 | 182 | 23.8 | 111 | 1 | 1 | 1 |
| 5 | 5 | 1 | 52 | 175 | 34.1 | 328 | 0 | 0 | 1 |
| 6 | 6 | 1 | 39 | 176 | 22.7 | 53 | 0 | 2 | 4 |
This dataset maps genetic variants (single nucleotide polymorphisms or SNPs) and their relationship to cholesterol (chol) and triglycerides (TG) for 3 genes: rs174548 (FADS1 - an enzyme in fatty acid unsaturation), rs4775401 (a candidate SNP), and APOE (a major apolipoprotein important for Alzheimer's disease).
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE |
|---|---|---|---|---|---|---|---|---|
| patient code | Male=0 Female=1 |
patient age | cholesterol | Body mass index |
triglycerides | haplotype C/C = 0 C/G = 1 G/G = 2 |
haplotype C/C = 0 C/T = 1 T/T = 2 |
allele haplotype e2/e2 = 1 e2/e3 = 2 e2/e4 = 3 e3/e3 = 4 e3/e4 = 5 e4/e4 = 6 |
We are ultimately interested in the relationship between the above genetic variants and cholesterol, while controlling for factors such as age and sex. But let's get our feet wet by starting with the easier question: is there an association between mean serum cholesterol and age?
For this question cholesterol is the dependent variable, or the variable being measured. Age is the independent variable that we are changing to determine the effect on cholesterol.
It is always, always, always, a good idea to make an 'exploratory' plot of your data and get an idea of what its distribution looks like. We can start with a simple scatter plot of age and cholesterol.
# Plot age vs cholesterol as a scatterplot
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = age, y = chol) +
# 4. Geoms
geom_point()
Our data above looks pretty spread out along these axes. Let's generate some summary statistics to explore it further. We can describe our data with some basic summary statistics such as the mean, median, mode, min, max, standard deviation, and variance (central tendency and dispersion measures). R does not have a mode function relating to statistics, but we can figure it out for ourselves.
#Revisit the strucutre of our data
str(cholesterol)
# Pipe our dataset over to the summarise function (mean, median, min, max, sd, variance)
cholesterol %>%
summarise(mean_chol = mean(chol),
median_chol = median(chol),
min_chol = min(chol),
max_chol = max(chol),
sd_chol = sd(chol),
variance_chol = var(chol))
# We need to calculate mode on our own by sorting
mode <- sort(table(cholesterol$chol), decreasing = TRUE)
# We can access table header information using the names() function
cat("Most frequent chol value is:", names(mode[1]), "with", mode[1], "occurrences") # most frequent value in our dataset
'data.frame': 400 obs. of 9 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ sex : int 1 1 0 0 1 1 0 0 0 0 ... $ age : int 74 51 64 34 52 39 79 38 52 58 ... $ chol : int 215 204 205 182 175 176 159 169 175 189 ... $ BMI : num 26.2 24.7 24.2 23.8 34.1 22.7 22.9 24.9 20.4 22 ... $ TG : int 367 150 213 111 328 53 274 137 125 209 ... $ rs174548 : int 1 2 0 1 0 0 2 1 0 0 ... $ rs4775401: int 2 1 1 1 0 2 1 1 1 1 ... $ APOE : int 4 4 4 1 1 4 1 1 4 5 ...
| mean_chol | median_chol | min_chol | max_chol | sd_chol | variance_chol |
|---|---|---|---|---|---|
| <dbl> | <dbl> | <int> | <int> | <dbl> | <dbl> |
| 183.915 | 184 | 117 | 247 | 22.11777 | 489.1958 |
Most frequent chol value is: 191 with 12 occurrences
geom_density to visualize your sample distribution¶Our mean, median and mode are not that different, and so our data is not skewed in either direction. We can also prove this to ourselves by making a quick density plot to visualize the residuals (each data point minus the mean)
# make a density plot around the mean cholesterol value
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(chol - mean(chol)) + # Subtract the mean from each value to centre the distribution around 0
# 4. Geoms
geom_density() # Use a kernel density estimate
quantile() to gauge the range of your distribution¶Another way to dissect our distribution is by examining the values/boundaries of our dataset when we break it into various portions. We can accomplish the task of breaking up our data using the quantile() function. We can set the levels of proportions we want to generate
Using the default behaviour of quantile() generates quartiles of our data which will also give us a good sense of the range and distribution of our data.
# Use the default quantile parameters
quantile(cholesterol$chol)
Going back to our question regarding if age is related to cholesterol, let's add the mean cholesterol to our plot for reference. This is done by adding geom_hline() to our plot and specifying the value for the yintercept.
# Modify our scatterplot to include a mean cholesterol value
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = age, y = chol) +
# 4. Geoms
geom_point() +
geom_hline(yintercept = mean(cholesterol$chol), color = "red")
From our plot, it looks like the mean might increase with age, but how do we test this?
T-tests are a simple statistical tool to compare pairs of means, i.e. between two groups at the time. We don't currently have age groups, but we can make them. One way to do this is to use our dplyr skills to create a new column 'age_group'. The data can be split at 55 years-old (the midpoint of age in our data).
We can use an if/else statement (the ifelse function) to test: is age greater than 55? If the answer is 'yes', the value is 1, and if the answer is 'no', the value is 0. We can take a quick look at our dataset to make sure this worked.
Remember mutate(data, new_col = criteria_or_calculation)
# Add an age_group variable to our cholesterol dataset
cholesterol <- cholesterol %>%
mutate(age_group = ifelse(test = cholesterol$age > 55, yes = 1, no = 0))
# Check our update dataset
str(cholesterol)
'data.frame': 400 obs. of 10 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ sex : int 1 1 0 0 1 1 0 0 0 0 ... $ age : int 74 51 64 34 52 39 79 38 52 58 ... $ chol : int 215 204 205 182 175 176 159 169 175 189 ... $ BMI : num 26.2 24.7 24.2 23.8 34.1 22.7 22.9 24.9 20.4 22 ... $ TG : int 367 150 213 111 328 53 274 137 125 209 ... $ rs174548 : int 1 2 0 1 0 0 2 1 0 0 ... $ rs4775401: int 2 1 1 1 0 2 1 1 1 1 ... $ APOE : int 4 4 4 1 1 4 1 1 4 5 ... $ age_group: num 1 0 1 0 0 0 1 0 0 1 ...
A t-test is a fairly straight forward yet powerful test to determine if two means are statistically (significantly) different. However, there are some assumptions (criteria) that our data needs to meet in order to apply a t-test:
Before applying a t-test, let's make sure our data comes from a normally distributed population. From above, the density plot we generated suggests values are part of a normal distribution but let's check with some other tests.
This is a common test for normality that essentially compares your data to a normal distribution and provides a p-value for the likelihood of the null hypothesis. The NULL(H0) is that the samples came from the Normal distribution.
Shapiro requires sample size's between 3 and 5,000. If your p-value $\leq$ 0.05, then you would reject the NULL (H0) hypothesis - suggesting your data is not from a normal distribution.
# Run the Shapiro-Wilks on both age groups
shapiro.test(cholesterol$chol[cholesterol$age_group == 0]) # Younger than 55 years
shapiro.test(cholesterol$chol[cholesterol$age_group == 1]) # Older than 55 years
# Normality test for all cholesterol levels, regardless of the age group
# H0 or null hypothesis: samples come from a population that follows a normal distribution
# H1: samples come from a population that does not follows a normal distribution (the opposite of H0).
# In hypothesis testing, you're not trying to prove H1 but to reject H0.
# What happens when we take the log of our data?
shapiro.test(log(cholesterol$chol))
Shapiro-Wilk normality test data: cholesterol$chol[cholesterol$age_group == 0] W = 0.99489, p-value = 0.7306
Shapiro-Wilk normality test data: cholesterol$chol[cholesterol$age_group == 1] W = 0.99166, p-value = 0.311
Shapiro-Wilk normality test data: log(cholesterol$chol) W = 0.98982, p-value = 0.007116
As we mentioned in our ggplot class from lecture 04, we can generate a quick histogram to look at how our data is roughly distributed. This is similar to a density plot but using the values or our data binned into group. It may, however, look quite familiar to you already.
# histogram 2 ways
# Using base R
hist(cholesterol$chol)
# Using ggplot, let's include a kernel density too
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(chol) + # Subtract the mean from each value to centre the distribution around 0
# 4. Geoms
geom_histogram(binwidth=10, alpha = 0.3) +
geom_density(aes(y=10*after_stat(count))) # Use a kernel density estimate
What is a quantile-quantile plot (Q-Q plot)?
It's a scatter plot. More specifically, it's a scatter plot of your values (sorted in ascending order) against a sample size-matched number of quantiles from a chosen theoretical distribution. There are multiple ways to generate a Q-Q plot in R but here we will use:
qqnorm() to check our data against a normal distribution andqqline() to draw a line going through the first and third quartileWhat do we want to see?
Essentially a straight line. Depending on the shape and tails, you may need to transform your data to match a normal distribution. If that doesn't seem possible, then you really should not continue with a parametric test like the t-test.
# qqplot with a qqline
qqnorm(cholesterol$chol); qqline(cholesterol$chol)
# log transform your data
qqnorm(log(cholesterol$chol)); qqline(log(cholesterol$chol))
Also written as homoskedasticity it relates to the variance within each group. If we generate a regression model, is the residual, or error term, constant? Otherwise, if the error term differs significantly across independent variables (ie age_group) then the model is missing a large effect from another independent variable.
We'll use Bartlett's test with bartlett.test() which essentially tests if k groups have equal variances and returns a K-squared value and p-value against our null hypothesis (H0 = the groups have the same variance). The call we will use takes the form:
bartlett.test(formula = values ~ grouping, data = dataset)
where values ~ grouping means to use the data from the variable values and to group it by grouping when retrieving from dataset.
Afterwards, we will compare the output of our test against a $\chi^{2}$ value with probability $(1-\alpha)$ and $df = k-1$ with qchisq()
A quick warning: this test assumes normality for your samples - if this is not the case, a rejection of H0 may be a rejection of normality instead!
## Bartlett's H0: There is homoskedasticity between samples cholesterol of age groups
bartlett.test(chol ~ age_group, data=cholesterol)
#equivalent to
bartlett.test(cholesterol$chol, cholesterol$age_group)
# If Chi-squared > Bartlett's K-squared, means that data are homoskedastic (not reject H0)
qchisq(p = 0.950, df = 1)
Bartlett test of homogeneity of variances data: chol by age_group Bartlett's K-squared = 2.7431, df = 1, p-value = 0.09767
Bartlett test of homogeneity of variances data: cholesterol$chol and cholesterol$age_group Bartlett's K-squared = 2.7431, df = 1, p-value = 0.09767
In our case, from testing above, the data follows a normal distribution and although the variance between groups is not exactly identical, it is still homoscedastic! We can now use a boxplot to look at the distribution of cholesterol for our 2 groups.
Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend from the 1st and 3rd quartiles to the closest values that are no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = factor(age_group), y = chol) +
xlab("age") +
ylab("cholesterol (mg/dl)") +
# 3. Scaling
scale_x_discrete(labels = c("30-55", "56-80")) +
# 4. Geoms
geom_boxplot() +
geom_jitter()
There seems to be a lot of overlap in our cholesterol values. How do we tell if the means are truly different?
Let's think about this a little more explicitly:
The null hypothesis (H0) is that there is no difference in the sample means between our groups.
An alternative hypothesis (H1) is that there is a difference between the means (2-sided test), or that the difference in means is greater or lesser than zero (1-sided test).

$\alpha$ is our p-value, and $\mu$ is the population mean, k is our sample mean. Remember that we are estimating the true population mean using the sample that we have. Our p-value is the probability of finding our observed value by chance given that the null hypothesis is true.
We will use a simple Student's t-test to test the alternative hypothesis that the true difference in means is not equal to 0.
The t.test function takes as input the variables on which we are performing the test (as a vector), the "type" of t-test being performed ("two.sided", "less", or "greater"), and the confidence interval. The confidence interval is the interval that will cover the true parameter x% of the time. In the image above the confidence interval covers the pink area, (1-$\alpha$).
You can alternatively enter your variables in a formula, in this case y ~ x. The ~ in r language is used to separate the left and right sides of a formula. (You can run a 1-sided t-test by specifying alternative = 'greater' or alternative = 'less'). In this case, alternative = 'two-sided' and conf.level = 0.95 are the default parameters and only included for clarity. For now we are assuming that equal variance is true from our previous tests.
#?t.test
# Provide t.test with the two groups of data to compare
t.test(x = cholesterol$chol[cholesterol$age_group == 0], # 55 years or younger
y = cholesterol$chol[cholesterol$age_group == 1], # 56 years or older
alternative = "two.sided", conf.level = 0.95, var.equal = TRUE)
Two Sample t-test data: cholesterol$chol[cholesterol$age_group == 0] and cholesterol$chol[cholesterol$age_group == 1] t = -3.6349, df = 398, p-value = 0.0003146 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -12.202574 -3.636122 sample estimates: mean of x mean of y 179.9751 187.8945
#is equivalent to
t.test(formula = chol ~ age_group, data = cholesterol, var.equal = TRUE)
Two Sample t-test
data: chol by age_group
t = -3.6349, df = 398, p-value = 0.0003146
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-12.202574 -3.636122
sample estimates:
mean in group 0 mean in group 1
179.9751 187.8945
Our output tells us the mean cholesterol for those aged 30-55 is 179.98 mg/dl and the mean for those aged 56-80 is 187.89 mg/dl. The difference in means is significant at a p-value of 0.0003146 (< 0.05).
So we now know there is a positive relationship between cholesterol and age. However the t-test has limitations. What is the magnitude of this relationship during aging? Does it change by approximately the same amount per year? What if we don't want to break our data into groups?
For the answer to this, we'll need to open learn a new toolset.
Source: "All (or most) of statistics". Bolker, 2007.
There are a ton of models (or families of models) out there for different statistical purposes and with different assumptions. These assumptions, if violated, can give incorrect predictions. However, we might not know if these assumptions are true when selecting our model. Today we are hanging out in the top left corner, and we are going to learn the assumptions of linear models in general, the specific models we will be using today, and an example of each. We will trouble-shoot when assumptions fail later in the lesson.
1. Observed values are independent of each other (independence)
2. Variation around expected values (residuals) are normally distributed (normality)
(James et al. (2017). An Introduction to Statistical Learning)
3. constant variance, homoscedastic (equal variance)
4. observed values (y) are related by linear functions of the parameters to x (linearity)
Assumptions 2 and 3 are often grouped together.
For simple linear regression we are modeling a continuous outcome by a single continuous variable. Example: modeling cholesterol using BMI.
For multiple linear regression we are modeling a continuous outcome by more than one continuous variable. Example: modeling cholesterol using BMI and age. In this case, we must consider whether there is an interaction between age and BMI on cholesterol (more on interactions to follow).
For one-way ANOVA we are modeling a continuous outcome by a single categorical variable. Example: modeling cholesterol by sex. It is important that categorical variables are explicitly input as factors to be interpreted properly in the model. For example, since we have encoded sex as 0 and 1 (instead of 'M' and 'F'), we need to specify that sex is to be treated as ordinal and not as a number. Therefore we specify sex as a factor of 2 levels, 0 and 1.
For multi-way ANOVA we are modeling a continuous outcome by more than one categorical variable. Example: modeling cholesterol by sex and APOE genetic variants. Again, we need to consider any interaction between our categorical variables, and we need to specify our numeric values to be treated as categorical variables and not numbers. APOE will be a factor of 6 levels, one for each genetic variant.
Lastly, for ANCOVA we are modeling a continuous variable by a combination of categorical AND continuous variables. Example: modeling cholesterol using the genetic variants of APOE and BMI. Again, our categorical variable must be input as a factor. ANCOVA allows for each group (each genetic variant of APOE in this example) to have a separate slope.
This is a summary table you might find helpful for choosing a model based on the data types you have, and the assumptions you are making. I hope to show that model selection is akin to going through mental checklist for your data, and not that scary. The independence assumption is required for all the models below, and is not included in the chart for spacing reasons.
| model | categorical | continuous | linearity | normality | homoscedasticity |
|---|---|---|---|---|---|
| simple linear regression | X | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| multiple linear regression | X | $\checkmark\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| one-way analysis of variance (ANOVA) | $\checkmark$ | X | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| multi-way ANOVA | $\checkmark\checkmark$ | X | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| analysis of covariance (ANCOVA) | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| nonlinear least squares | X | $\checkmark$ | X | $\checkmark$ | $\checkmark$ |
| nonlinear ANCOVA | $\checkmark$ | $\checkmark$ | X | $\checkmark$ | $\checkmark$ |
| generalized linear models | $\checkmark$ | $\checkmark$ | X* | X | X |
*restricted cases
What is the relationship between cholesterol and age?
Now, we can pick a model to answer our question by considering the assumptions above instead of a t-test.
If we evaluate our independent and dependent variables, age and cholesterol, they are both continuous, not categorical. We only have one independent variable. From the plot we made earlier (repeated below) it looks like if there is a relationship between age and cholesterol it would be linear.
# Show a scatterplot again of age vs cholestorol
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = age, y = chol) +
# 4. Geoms
geom_point()
Based on the above criteria, we will try using a simple linear regression to test the association between mean serum cholesterol and age.
ggplot()¶What we are looking for, is the slope of the line relating cholesterol to age, which will tell us the magnitude and direction of the relationship between these variables. We can look at the slope for the linear model that ggplot() would fit for us for an idea of what our model will look like.
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = age, y = chol) +
# 4. Geoms
geom_point() +
stat_smooth(method = "lm") # Add a stat_smooth as we've seen in the past
`geom_smooth()` using formula 'y ~ x'
To make sense of what we're seeing on our plot, let's review the equation for a straight line:
\begin{equation*}\Large Y \sim Normal(a + bx, {\sigma^2}) \end{equation*}y is our dependent variable that we are attempting to model.
x is our independent variable.
a is the intercept (the value of y where x = 0; where x crosses the y-axis).
b is the slope of the line (the change in y corresponding to a unit increase in x).
Normal is telling us that our error is normally distributed.
$\sigma^2$ is the variance (squared deviation of the variable x from its mean)
Slopes
The interpretation in our example is that the slope is the difference in mean serum cholesterol associated with a one year increase in age.
With a straight line we are not, of course, plotting through all of our points, but rather close to the mean of an outcome in y as a function of x. For example, there are only six values of cholesterol for the 50 year-old age group, and our line will fall somewhere close to the mean of these values. Values of y have a distribution at a given x, which we have assumed is normally distributed.
Lastly, in this equation we also have some normally distributed error. Sampling error exists in our estimates because different estimates give different means.
How do we actually find the best fitting line? We use least squares estimation, which minimizes the sum of squares of the vertical distances from the observed points to the least squares regression line ($y -\hat{y}$) .
$y$ - observed value
$\hat{y}$ - estimated value
$\bar{y}$ - sample mean
Let's run this simple linear regression. Using R, the intercept and slope terms are implicit.
R code:
lm(y ~ x)
There are times when the intercept, the value of y at x = 0, doesn't make much intuitive sense to interpret our data. To force the intercept to zero (y = bx) to have relative comparisons instead, use lm(y ~ x-1).
As we are used to writing equations, our dependent variable (cholesterol) is on the left side the lm formula and our independent variable (age) is on the right side; tilde ~ separates these sides. We also input the dataset to the lm function.
# Generating a linear model. Looks a lot like the bartlett.test() format right?
lm(formula = chol ~ age, data = cholesterol)
Call: lm(formula = chol ~ age, data = cholesterol) Coefficients: (Intercept) age 166.9017 0.3103
The function will output our formula, the slope and the intercept. However, if we save the output of the function into an object, 'fit', we get a list object of the model, the input, and all associated statistics. We can look at a summary() and get residuals, errors, p-values and more in addition to our coefficients.
# Assign the linear model to an object
ageFit <- lm(formula = chol ~ age, data = cholesterol)
# ageFit is a list of 12 elements that can be individually accessed by using the dollar operator
str(ageFit)
List of 12 $ coefficients : Named num [1:2] 166.9 0.31 ..- attr(*, "names")= chr [1:2] "(Intercept)" "age" $ residuals : Named num [1:400] 25.13 21.27 18.24 4.55 -8.04 ... ..- attr(*, "names")= chr [1:400] "1" "2" "3" "4" ... $ effects : Named num [1:400] -3678.3 89.45 17.61 1.86 -9.49 ... ..- attr(*, "names")= chr [1:400] "(Intercept)" "age" "" "" ... $ rank : int 2 $ fitted.values: Named num [1:400] 190 183 187 177 183 ... ..- attr(*, "names")= chr [1:400] "1" "2" "3" "4" ... $ assign : int [1:2] 0 1 $ qr :List of 5 ..$ qr : num [1:400, 1:2] -20 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:400] "1" "2" "3" "4" ... .. .. ..$ : chr [1:2] "(Intercept)" "age" .. ..- attr(*, "assign")= int [1:2] 0 1 ..$ qraux: num [1:2] 1.05 1.02 ..$ pivot: int [1:2] 1 2 ..$ tol : num 1e-07 ..$ rank : int 2 ..- attr(*, "class")= chr "qr" $ df.residual : int 398 $ xlevels : Named list() $ call : language lm(formula = chol ~ age, data = cholesterol) $ terms :Classes 'terms', 'formula' language chol ~ age .. ..- attr(*, "variables")= language list(chol, age) .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. ..$ : chr [1:2] "chol" "age" .. .. .. ..$ : chr "age" .. ..- attr(*, "term.labels")= chr "age" .. ..- attr(*, "order")= int 1 .. ..- attr(*, "intercept")= int 1 .. ..- attr(*, "response")= int 1 .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. ..- attr(*, "predvars")= language list(chol, age) .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. ..- attr(*, "names")= chr [1:2] "chol" "age" $ model :'data.frame': 400 obs. of 2 variables: ..$ chol: int [1:400] 215 204 205 182 175 176 159 169 175 189 ... ..$ age : int [1:400] 74 51 64 34 52 39 79 38 52 58 ... ..- attr(*, "terms")=Classes 'terms', 'formula' language chol ~ age .. .. ..- attr(*, "variables")= language list(chol, age) .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. .. ..$ : chr [1:2] "chol" "age" .. .. .. .. ..$ : chr "age" .. .. ..- attr(*, "term.labels")= chr "age" .. .. ..- attr(*, "order")= int 1 .. .. ..- attr(*, "intercept")= int 1 .. .. ..- attr(*, "response")= int 1 .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. .. ..- attr(*, "predvars")= language list(chol, age) .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. .. ..- attr(*, "names")= chr [1:2] "chol" "age" - attr(*, "class")= chr "lm"
summary() to look at the salient information about your model¶As you can see from looking at the structure, the lm model object holds a lot of information. We can use the summary() function to pull out some of the more important information about our model instead of wading through all of the data. It should be noted, that this information can vary between different types of objects as well.
# Look at the summary information of 'fit'
summary(ageFit)
Call:
lm(formula = chol ~ age, data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-60.453 -14.643 -0.022 14.659 58.995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 166.90168 4.26488 39.134 < 2e-16 ***
age 0.31033 0.07524 4.125 4.52e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.69 on 398 degrees of freedom
Multiple R-squared: 0.04099, Adjusted R-squared: 0.03858
F-statistic: 17.01 on 1 and 398 DF, p-value: 4.522e-05
The intercept is 166.9 and the slope is 0.31. What does that actually mean? It means a baby (age 0 or 0 years) would be expected to have on average a serum cholesterol of 166.9 mg/dl. For every yearly increase in age, mean serum cholesterol is expected to increase by 0.31 mg/dl. These results are significant with a p-value < 0.001. We can reject the null hypothesis and say that mean serum cholesterol is significantly higher in older individuals. The Adjusted R-squared value tells us that about 4% (0.03858) of the variability in cholesterol is explained by age.
Note there are two sets of p-values presented: Pr(>|t|), and a p-value associated with the F-statistic.
Pr(>|t|) is an evaluation of the co-effiecient producing this t-statistic against a null hypothesis in which the true co-efficient is 0. In other words, looking at the intercept of 166.90, it is the proportion of the t-distribution at df = 398 which is greater than the absolute value of your t statistic (39.134).
p-value tells us about the model overall and whether it is predicting better than just random noise. We will discuss F-values a little more in just a few sections.
confint() to generate confidence intervals on your data¶Given that we have a model object like our ageFit linear model, we can generate a confidence interval using the confint() method. By default this function sets the argument level to 0.95 or a 95% confidence interval level. Let's generate a confidence interval for our model parameters.
# Generate a confidence interval with confint()
confint(ageFit)
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 158.5171656 | 175.2861949 |
| age | 0.1624211 | 0.4582481 |
We can further get confidence intervals for these values to say that 95% of the time we expect the cholesterol of a baby to fall within 158.5-175.3 mg/dl, or that we are 95% confident that the difference in mean cholesterol associated with a one year increase in age is between 0.16 and 0.46 mg/dl.
In multiple linear regression we use multiple continuous dependent variables to predict outcome values. Additional terms can be added in 2 ways.
It was mentioned before that the 'linear' part of linear regression is the linear function of the parameters and not the independent variables. In the example below, the parameters $a$, $b_1$, and $b_2$ are linear even though the independent variable has a quadratic component, $x^2$. An example of this could be synthesizing a chemical, where with increasing temperature, synthesis progresses with an increasing curve.
Expression:
\begin{equation*}\Large Y \sim Normal(a + b_1x + b_2x^2, {\sigma^2}) \end{equation*}If we were to write this in R, again our intercept and coefficients are implicit. To write the quadratic term we use the function I which just means 'as.is'. This allows the interpreter to see the ^ as a power symbol and not as an interaction term.
R code:
lm(y ~ x + I(x^2))
Suppose from our data that we think cholesterol can be modeled by age as a quadratic equation. What would that look like?
# summarize the results of a multiple linear regression
summary(lm(chol ~ age + I(age^2), data = cholesterol))
Call:
lm(formula = chol ~ age + I(age^2), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-56.836 -15.063 -0.654 14.534 60.246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 120.890937 16.722415 7.229 2.51e-12 ***
age 2.120309 0.640815 3.309 0.00102 **
I(age^2) -0.016562 0.005824 -2.844 0.00469 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.5 on 397 degrees of freedom
Multiple R-squared: 0.06014, Adjusted R-squared: 0.05541
F-statistic: 12.7 on 2 and 397 DF, p-value: 4.498e-06
It looks like this modification to our model explains our model's variation (Adjusted R-squared) slighlty more but doesn't drastically improve on it.
We are interested in improving our model by adding an extra variable we think might have an effect on our outcome values. In the example below, we are adding the independent variables x1, x2, and each of these terms has their own linear parameter b1 and b2, respectively.
Expression:
\begin{equation*}\Large Y \sim Normal(a + b_1x_1 + b_2x_2, {\sigma^2}) \end{equation*}This is the model we will be using next. To aid with interpretation let's think about a parameter. b2 is the expected mean change in unit per change in x2 if x1 is held constant (sometimes called controlling for x1).
The null hypothesis in this case is that all b1, b2 = 0. The alternative hypothesis is that at least one of these parameters is not null.
Again in R the intercept and coefficients are implicit in the the lm function.
R code:
lm(y ~ x1 + x2)
We know that age has an effect on cholesterol. With our new model we want to ask the question: Is there a statistically significant relationship between mean serum cholesterol and age after controlling for BMI? Let's look graphically at these relationships to help us understand our model. First let's plot BMI vs cholesterol. We can add a linear fit to make sure we are expecting a positive slope.
# Plot the replationship between cholesterol and BMI
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = BMI, y = chol) + # Use BMI as the x-value
# 4. Geoms
geom_point() +
stat_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
We should also take a look at the relationship between BMI and age.
# Plot the relationship between age and BMI
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = age, y = BMI) + # Use age as the x-value, BMI as the y-value
# 4. Geoms
geom_point() +
stat_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
From our analysis, we see that cholesterol increases with BMI. Furthermore, we see that BMI tends to increases with age. We will look at the association of age and cholesterol while holding BMI constant. This will indicate if the significance of our previous relationship between increasing cholesterol with increasing age was affected by BMI.
# Define our multiple linear regression
ageBMIFit <- lm(chol ~ age + BMI, data = cholesterol)
# summarise the model
summary(ageBMIFit)
Call:
lm(formula = chol ~ age + BMI, data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-58.994 -15.793 0.571 14.159 62.992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 137.1612 9.0061 15.230 < 2e-16 ***
age 0.2023 0.0795 2.544 0.011327 *
BMI 1.4266 0.3822 3.732 0.000217 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.34 on 397 degrees of freedom
Multiple R-squared: 0.07351, Adjusted R-squared: 0.06884
F-statistic: 15.75 on 2 and 397 DF, p-value: 2.62e-07
Our equation would now look like $y = 137.16 + 0.20age + 1.43BMI$.
The estimated increase in mean serum cholesterol after one year while holding BMI constant is 0.20 mg/dl. This increase is less than our previous value of 0.31 mg/dl. Why do the estimates differ?
Before, we were not controlling for BMI. Our estimates of the age associated increase in mean cholesterol is now for subjects with the same BMI and not for subjects with all BMIs.
It looks like both age and BMI are significant.From the Adjusted R-squared value, it looks like we now explain nearly 7% of our variablity. Still, we might want to verify if adding BMI actually made a difference to the model.
anova()¶We can compare our two models with the anova() function. The output of our models, ageFit and ageBMIFit, are lm objects. With 2 lm objects, the anova() function tests the models against one another to see if their coefficients are significantly different and prints these results in an analysis of variance table.
Given a single lm object, it will test whether model terms of a model are significant - we will be using the function in this format later.
# Here we're using the function anova to compare the coefficients of two linear models.
# Think of this as a "secondary" use of the anova function. We are not running an ANOVA per se.
anova(ageFit, ageBMIFit)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 398 | 187187.5 | NA | NA | NA | NA |
| 2 | 397 | 180841.7 | 1 | 6345.76 | 13.93079 | 0.0002173567 |
Our second model is significantly different from our first model. What is this significance based on?
The significance is a probability based on an F-test. While the t-test tells you if two means (from a single variable in our case) are statistically significant, an F-test tells you if a group of variables is jointly significant.
The F-test compares the joint effect of all variables together, a large F value means 'something' is significant. We can calculate the F-value ourselves from the information provided in the table
| Model | Res.df | RSS | df | Sum of Squares | F | Pr(>F) |
|---|---|---|---|---|---|---|
| 0 | 398 | 187187.5 | NA | NA | NA | NA |
| 1 | 397 | 180841.7 | 1 | 6345.76 | 13.93079 | 0.0002173567 |
Note that the Residual Sum of Squares (RSS) is also known as the Sum of Squared Errors (SSE). Recall that this is the based on the difference between our data and our model estimates for the same data point ($y -\hat{y}$)2. This represents the total variation/difference between our observed data and the model we have generated.
The degrees of freedom are calculated by the total sample size minus the number of groups relevant to the model. Note that cholesterol has 400 entries and we have split the variable $age$ into two groups {0,1}.
Let's see those model comparisons again:
| Model | Res.df | RSS | df | Sum of Squares | F | Pr(>F) |
|---|---|---|---|---|---|---|
| 0 | 398 | 187187.5 | NA | NA | NA | NA |
| 1 | 397 | 180841.7 | 1 | 6345.76 | 13.93079 | 0.0002173567 |
# Let's prove to ourselves that the formulas work!
rss.m0 = ... # RSS0
rss.m1 = ... # RSS1
rss.diff = ... #SSdelta
ms.diff = ... # MSdelta
ms.res = ... # MSR1
ms.diff/ms.res # F-value
So from some of the values supplied by anova() we can confirm that correct calculation of the F value.
Note: F-tests are not used alone because you still need to use a p-value to find out 'what' is significant. This p-value is part of the F-statistic and calculated as follows:
pf(F-value, Res.df-delta, Res.df1, lower.tail=FALSE).
The pf() function generates an F-distribution given the supplied degrees of freedom and finds the area ander the curve at the upper tail starting at the supplied F-value.
From that calculation we see that Pr(>F) is smaller than our $\alpha$ of 0.05.
Put simply, this results suggests that by increasing the complexity of our model to account for BMI, we have succeeded in improving the fit to our data since we already know from above that a higher percent of variance is explained in ageBMIFit compared to ageFit.
# How to calculate the p-value yourself
pf(13.93079, df1=1, df2=397, lower.tail=FALSE)
# Example F-distribution and area
# 1. Data
ggplot(data.frame(F=c(0,5))) +
# 2. Aesthetics
aes(F) +
theme_bw() +
# 4. Geoms (using a stat function)
stat_function(fun = df, geom="area", fill="grey", args=list(df1=3, df2=47)) +
stat_function(fun = df, geom="area", fill="red", args=list(df1=3, df2=47), xlim=c(3.5, 5))
AIC()¶The Akaike Information Criterion (AIC) is a another method for model selection. The AIC works to balance the trade-offs between the complexity of a given model and its goodness of fit (how well the models fits the data), where the best model is the one that provides the maximum fit for the fewest predictors. Its interpretation is fairly straight forward: The lower the score, the better a model is relative to the other models. If many models have similarly low AICs, you should choose the one with the fewest model terms (variables or parameters).
# compare ageFit and ageBMIFit using AIC()
AIC(ageFit, ageBMIFit)
| df | AIC | |
|---|---|---|
| <dbl> | <dbl> | |
| ageFit | 3 | 3600.511 |
| ageBMIFit | 4 | 3588.716 |
Given that ageBMIFit has a lower score, we can say that is the best model out of the two evaluated.
What is meant by an interaction?
There is an interaction if the association between the response and the predictor variable changes across the range of the new variable. This can be seen in the expression below, where the difference in means between x1 and x2 changes additionally by b3 for each unit difference in x2 or x1, i.e. the slope of x1 changes with x2, because b3 is changing.
In our case study, interaction refers to how BMI (the third variable) affects the interaction between cholesterol and age.
Expression:
\begin{equation*}\Large Y \sim Normal(a + b_1x_1 + b_2x_2 + b_3x_1x_2, {\sigma^2}) \end{equation*}In the graph below, there is an interaction between education and ideology. The slope indicating the probability that people will care if sea level rises 20 feet, changes with each education level and each shift in ideology. If there was no interaction with ideology and education, the slopes shown would be parallel.
When testing for an interaction between 2 input variables, the lm() input takes an asterisk * instead of a + between the dependent variables.
R code:
lm(y ~ x1 * x2)
An interaction is different than a confounding factor, for which the association between the response and predictor variable is constant across the range of the new variable. You can think of confounding factor as variables that have an effect on the outcome, but haven't been accounted for.
For example, in our first model where the increase in cholesterol was ONLY due to an increase in age, BMI would be a confounding factor because weight contributes significantly to an increase in cholesterol, and age alone is not responsible for the increase in cholesterol.
Test if there is an interaction between age and BMI in a model predicting mean serum cholesterol. Note that these are both continuous variables.
# Build a new model accounting for an interaction between age and BMI
ageIntBMIFit <- lm(chol ~ age*BMI, data = cholesterol)
# summarise the model
summary(ageIntBMIFit)
Call:
lm(formula = chol ~ age * BMI, data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-59.207 -15.767 0.379 13.926 62.715
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.450e+02 3.647e+01 3.975 8.37e-05 ***
age 6.085e-02 6.461e-01 0.094 0.925
BMI 1.109e+00 1.489e+00 0.745 0.457
age:BMI 5.694e-03 2.581e-02 0.221 0.826
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.37 on 396 degrees of freedom
Multiple R-squared: 0.07362, Adjusted R-squared: 0.0666
F-statistic: 10.49 on 3 and 396 DF, p-value: 1.182e-06
# Compare to our previous models
# linear vs. interaction
anova(ageFit, ageIntBMIFit)
# multiple linear vs. interaction
anova(ageBMIFit, ageIntBMIFit)
# AIC of all three
AIC(ageFit, ageBMIFit, ageIntBMIFit)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 398 | 187187.5 | NA | NA | NA | NA |
| 2 | 396 | 180819.5 | 2 | 6367.976 | 6.973028 | 0.001056228 |
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 397 | 180841.7 | NA | NA | NA | NA |
| 2 | 396 | 180819.5 | 1 | 22.2158 | 0.04865327 | 0.8255372 |
| df | AIC | |
|---|---|---|
| <dbl> | <dbl> | |
| ageFit | 3 | 3600.511 |
| ageBMIFit | 4 | 3588.716 |
| ageIntBMIFit | 5 | 3590.667 |
From our analyses above, the comparison suggest that our ageBMIFit model is not significantly different when using an interaction model, ageIntBMIFit. This is confirmed with the results of the Akaike Information Criterion test as well.
Up to now we've been comparing the effect of age and BMI on cholesterol levels but these are continuous variables. Remember we have also included haplotypes for three genetic loci in our dataset. How do we build a model analysing the influence of these independent variables on our outcome (cholesterol)?
The suite of general linear models includes many statistical models, which we will continue to discuss in the context of our data. Until now, we have been focused on the subgroup of regression models - in which all of our independent variables are quantitative. The remainder of our models, while still being general linear models, include one more categorical independent variables.
Recall our data tracks a number of genetic variants and codes the genotype of observations for each:
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE |
|---|---|---|---|---|---|---|---|---|
| patient code | Male=0 Female=1 |
patient age | cholesterol | Body mass index |
triglycerides | haplotype C/C = 0 C/G = 1 G/G = 2 |
haplotype C/C = 0 C/T = 1 T/T = 2 |
allele haplotype e2/e2 = 1 e2/e3 = 2 e2/e4 = 3 e3/e3 = 4 e3/e4 = 5 e4/e4 = 6 |
In the analysis of variance (ANOVA) all independent variables are categorical (factors) rather than continuous. This allows us to ask the question:
Does the genetic factor rs174548 have an effect on cholesterol levels?
Our categorical example is represented by $\alpha$ and $i$ represents the levels of our factor.
Expression:
\begin{equation*}\Large Y \sim Normal({\alpha_i}, {\sigma^2}) \end{equation*}We still use the lm() function, however we replace our continuous variable with f, a categorical variable (factor). If your data is character type, R will automatically make a factor for you. However, if your data is numeric, R will interpret it as continuous. In this case, you need to make your numeric data a factor first using factor().
R code:
lm(y ~ f)
R parameterizes the model in terms of the differences between the first group and subsequent groups (ie. relative to the first group) rather than in terms of the mean of each group. This is similar to the interpretation of the previous linear models. You can instead fit the means of each group using: lm(y ~ f-1), similar to how we force the y-intercept to 0.
To begin the journey in answering our question, we first plot the relationship between rs174548 and cholesterol.
# Let us look at the structure of our dataset to again
str(cholesterol)
# First, take a look at the levels of the factor rs174548 using the level() function.
# It's the equivalent to ask for unique() in a character vector
unique(cholesterol$rs174548)
# Get the levels
levels(cholesterol$rs174548)
'data.frame': 400 obs. of 10 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ sex : int 1 1 0 0 1 1 0 0 0 0 ... $ age : int 74 51 64 34 52 39 79 38 52 58 ... $ chol : int 215 204 205 182 175 176 159 169 175 189 ... $ BMI : num 26.2 24.7 24.2 23.8 34.1 22.7 22.9 24.9 20.4 22 ... $ TG : int 367 150 213 111 328 53 274 137 125 209 ... $ rs174548 : int 1 2 0 1 0 0 2 1 0 0 ... $ rs4775401: int 2 1 1 1 0 2 1 1 1 1 ... $ APOE : int 4 4 4 1 1 4 1 1 4 5 ... $ age_group: num 1 0 1 0 0 0 1 0 0 1 ...
NULL
Why are the levels of our variable rs174548 set to NULL? Looking at the structure we see that this variable is listed as an int and is not yet a factor - so no levels exist.
Recall that we can coerce from integers and character variables into factors by casting them directly as we see below. We can do the same with variables, on the fly, as we plot our data into a scatterplot and boxplot.
# Make it a factor and then get its levels
levels(factor(cholesterol$rs174548))
# scatter plot our cholesterol data but colour based on rs174548
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x=age, y=chol, colour = as.factor(rs174548)) +
# Geoms
geom_point()
Hard to see the distribution of our factors right? It looks all very mixed up. Perhaps a boxplot will clarify this data?
# Make a boxplot of rs174548
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = as.factor(rs174548), y = chol) +
# 4. Geoms
geom_boxplot() +
geom_jitter()
From our preliminary plots, we see that our genetic factor has 3 groups, and we will be comparing the means for each of these groups. These groups have high variance, and there is a good deal of overlap between them.
Still, in a nutshell...
To assess whether the means are equal, the model compares:
Recall our formulas when comparing between models with anova()? You can see it's similar in concept except in the numerator and denominator are calculated in relation to the means of each factor level. The larger the MSR compared to the MSE, the more support there is for a difference between the population means.
So how do we put this to practical use in R?
Now that we know more about how a one-way ANOVA is calculated, let's talk about how write the expression to model our categorical variable. The form of this equation should look more familiar.
Expression:
\begin{equation*}\Large Y \sim Normal({\beta_0 + \beta_1x_1 + \beta_2x_2}, {\sigma^2}) \end{equation*}The interpretation of this model is a bit trickier.
Alternatively,
So you can think of each of these groups having their own means, i.e. $\mu_0 = \beta_0$, $\mu_1 = \beta_0 + \beta_1$, $\mu_2 = \beta_0 + \beta_2$. We are testing the hypothesis that not all of these means are equal.
Before jumping into the ANOVA, we should consider some ways to help make our analysis simpler. For instance, we can encode our categorical variable as a dummy or treatment variable. In our data frame for the rs174548 variable, 0 stands for the genotype C/C, 1 is C/G and 2 is G/G. Therefore, there are $k$ factor levels to our categorical variable. Using this information, we can create a matrix with $k-1$ separate columns of 0s and 1s to represent our levels and we will contrast each to the reference level (C/C). A genotype will be compared to the reference genotype if it has a 1, and so all comparisons to the reference are included in the matrix.
| rs174548 | x1 | x2 |
|---|---|---|
| C/C | 0 | 0 |
| C/G | 1 | 0 |
| G/G | 0 | 1 |
Luckily for us, R will automatically create dummy variables in the background if you state you have a categorical variable.
Recall that ANOVA is also based on testing hypotheses. It will use the F-value and degrees of freedom from our model to ascertain which of the following hypotheses we should accept/reject.
$H_0$ (null hypothesis): there is no difference between group means
$H_a$ (alternate hypothesis): there is a difference between group means
Also, there are technically 2 ways for us to do this analysis!
lm() and summary()¶As before, we can build our linear model with lm() and then generate summary statistics for it. This will give us some helpful information about the model in understanding or interpreting it's information.
# Create a one-way anova model by converting rs174548 to a factor
# Note that R will create the dummmy variable for us
rs174548Fit <- lm(chol ~ as.factor(rs174548), data = cholesterol) # C/C, represented by 0, is our reference variable (intercept)
# Retrieve a summary of our model
summary(rs174548Fit)
Call:
lm(formula = chol ~ as.factor(rs174548), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-64.062 -15.913 -0.062 14.938 59.136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 181.062 1.455 124.411 < 2e-16 ***
as.factor(rs174548)1 6.802 2.321 2.930 0.00358 **
as.factor(rs174548)2 5.438 4.540 1.198 0.23167
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.93 on 397 degrees of freedom
Multiple R-squared: 0.0221, Adjusted R-squared: 0.01718
F-statistic: 4.487 on 2 and 397 DF, p-value: 0.01184
Under our Coefficients table we see that the individula p-values for seeing the cholesterol values in a specific haplotype given that the $H_0$ is true. Haplotype 0 is folded into the intercept.
Alternatively,
The summary tells us that the genetic factor rs174548 has an effect on cholesterol at a significance level of p = 0.01884 (< 0.05). Looking closely at the Adjusted R-squared value, we also see that this independent variable accounts for only 1.718% of variability in our cholesterol data.
An analysis of variance table from calling anova() with one model as input gives us the same F-value and p-value.
# anova() returns only significant differences in the model
anova(rs174548Fit)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> | <dbl> | |
| as.factor(rs174548) | 2 | 4314.195 | 2157.0973 | 4.486538 | 0.01183627 |
| Residuals | 397 | 190874.915 | 480.7932 | NA | NA |
As you can see, we were able to retrieve the same information about the overall model showing that rs174548 genotype is a contributing factor to cholesterol values.
aov() and summary() to analyse our model¶With the simple models we are generating, there is little advantage to using aov() over lm(). In fact, aov() calls on the lm() function behind the scenes. However, for more complicated models involving multiple factors and interactions, the summary output for aov() is better and it can also support error terms in your model.
So, leave this as an additional option to explore when generating more complex models.
# An equivalent command would be using aov()
summary(aov(chol~ as.factor(rs174548), data=cholesterol))
Df Sum Sq Mean Sq F value Pr(>F) as.factor(rs174548) 2 4314 2157.1 4.487 0.0118 * Residuals 397 190875 480.8 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that the p-value for the whole set of comparisons is 0.01184 against our $H_0$ of the predictor, rs174585, having no relationship with the response variable of cholesterol. We must consider what this really means for the model we gave to lm.
All of this analysis tells us that there is a difference in means (rejects the null hypothesis that all means are the same), but it does not tell us which population/group means are significantly different from one another.
In order to look at this we need to look at multiple pairwise comparisons of
$$\mu_0 = \mu_1,\;\;\; \mu_0 = \mu_2,\;\;\; \mu_1 = \mu_2$$Post-Hoc in Latin, means "after this", referring to analyses done after an ANOVA, with the objective of identifying which means, if any, show significant differences.
Multiple comparisons increase the family-wise error rate (FWER) - the probability of making a false discovery (a.k.a. a false positive or Type I error). This is where multiple test corrections come in to control the error at a specific threshold, i.e. $\alpha = 0.05$ or 5%. One of the simplest and conservative is the Bonferroni correction ($\alpha/k$ or multiplying p-values by $k$).
Remember: $k$ is the number of factor levels in our categorical variable. ie $k = 3$ for rs174585
To run multiple tests to see if group means differ we can use this equation for general linear hypothesis testing, which takes two objects:
We'll start by remaking our model with a slight change to obtain the absolute mean of each group rather than the relative mean. Previously, it was mentioned that you can fit the means for each group using lm(y ~ f-1).
# test the fit of our model
rs174548TestFit <- lm(chol ~ as.factor(rs174548) - 1, data = cholesterol) # -1 drops the intercept
summary(rs174548TestFit)
Call:
lm(formula = chol ~ as.factor(rs174548) - 1, data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-64.062 -15.913 -0.062 14.938 59.136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
as.factor(rs174548)0 181.062 1.455 124.41 <2e-16 ***
as.factor(rs174548)1 187.864 1.809 103.88 <2e-16 ***
as.factor(rs174548)2 186.500 4.300 43.37 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.93 on 397 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.986
F-statistic: 9383 on 3 and 397 DF, p-value: < 2.2e-16
contrMat()¶For the second object needed in our group comparisons, we want to make a contrast matrix. The simplest contrast matrix is a matrix of 0, 1, and -1's, where the relationship -1 and 1 are the factor levels for which you want to test the differences. Notice that the total of each row is 0.
# A contrMat is a one degree of freedom test comparing means.
# It can compare more than two means if you group them in such way that at the end,
# there are only two groups being compared: average of the means of groups 1 and 2 versus the mean of group 3
M <- contrMat(table(cholesterol$rs174548), type = "Tukey")
# tukey is a single-step multi pair-wise comparison test to find differences in means
M
# Notice the attributes of our Contrast matrix
print("Structure of M")
str(M)
| 0 | 1 | 2 | |
|---|---|---|---|
| 1 - 0 | -1 | 1 | 0 |
| 2 - 0 | -1 | 0 | 1 |
| 2 - 1 | 0 | -1 | 1 |
[1] "Structure of M" 'contrMat' num [1:3, 1:3] -1 -1 0 1 0 -1 0 1 1 - attr(*, "dimnames")=List of 2 ..$ : chr [1:3] "1 - 0" "2 - 0" "2 - 1" ..$ : chr [1:3] "0" "1" "2" - attr(*, "type")= chr "Tukey"
glht()¶The rownames of the contrast matrix tell us what is being compared ([1-0], [2-0], [2-1]). For example [1-0] is the difference between C/C (-1) and C/G (1).
More complicated comparisons can be made. For example, the difference between C/C and the average of G/G and C/G could be specified by adding a row to the matrix of -2 1 1 (Note: rows of a contrast matrix must add to zero).
To get estimates using general linear hypothesis testing, we use the aptly named glht() function with our linear hypotheses to be compared as specified by our contrast matrix. The function will take in our ANOVA model and produce the summary information used to compare each combination of model variables put forth in our contrast matrix. Overall this takes the work out of creating our own series of comparisons.
We will first look at a summary without adjusting/correcting our p-values.
# library(multcomp)
# multiple comparisons made with our parametric model
mc <- glht(rs174548TestFit, linfct = M)
# Retrieve the summary of our glht object
summary(mc, test = adjusted("none")) # test is a function for computing p values.
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = chol ~ as.factor(rs174548) - 1, data = cholesterol)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 - 0 == 0 6.802 2.321 2.930 0.00358 **
2 - 0 == 0 5.438 4.540 1.198 0.23167
2 - 1 == 0 -1.364 4.665 -0.292 0.77015
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- none method)
[1 - 0 == 0] The difference in means between C/C and C/G is 6.80 mg/dl and this difference is significant.
[2 - 0 == 0] The difference in means between C/C and G/G is 5.44 mg/dl and this difference is not significant.
[2 - 1 == 0] The difference in means between C/G and G/G is -1.36 mg/dl and this difference is not significant.
You may notice that some of these p-values look familiar from our original summary() of the model. We have, however, added the comparison of the C/G to G/G genotypes.
Don't forget we've tested 3 different pairings and need to account for the possibility of false positives.

Remember: if we have set an $\alpha$ level of 0.05, then we may encounter an uncharacteristic result to our $H_0$ by chance once every 20 tests.
Let's check if multiple test correction affects these relationships.
# Adjust the p-values by Bonferroni
# Bonferroni corrects by testing each individual hypothesis at a significance level of α/m,
# where α is the desired overall alpha level and m is the number of hypotheses
summary(mc, test = adjusted("bonferroni"))
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = chol ~ as.factor(rs174548) - 1, data = cholesterol)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 - 0 == 0 6.802 2.321 2.930 0.0107 *
2 - 0 == 0 5.438 4.540 1.198 0.6950
2 - 1 == 0 -1.364 4.665 -0.292 1.0000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- bonferroni method)
# Control for the false discovery rate (FDR)
summary(mc, test = adjusted("BH")) # Benjamini & Hochberg
summary(mc, test = adjusted("fdr")) # alias for BH
#https://web.mit.edu/r/current/arch/i386_linux26/lib/R/library/multtest/html/mt.rawp2adjp.html
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = chol ~ as.factor(rs174548) - 1, data = cholesterol)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 - 0 == 0 6.802 2.321 2.930 0.0107 *
2 - 0 == 0 5.438 4.540 1.198 0.3475
2 - 1 == 0 -1.364 4.665 -0.292 0.7702
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- BH method)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: lm(formula = chol ~ as.factor(rs174548) - 1, data = cholesterol)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 - 0 == 0 6.802 2.321 2.930 0.0107 *
2 - 0 == 0 5.438 4.540 1.198 0.3475
2 - 1 == 0 -1.364 4.665 -0.292 0.7702
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- fdr method)
The significant difference in mean cholesterol between C/C and C/G genotypes of rs174548 holds under different multiple test corrections.
What if we use two or more categorical variables (factors) to model our outcome? We can now ask the question:
Does the effect of the genetic factor rs174548 differ between males and females?
We need to test whether there is an effect of our factors on cholesterol and also if there is an interaction between these factors.
Expression:
\begin{equation*}\Large Y \sim Normal({\alpha_i} + {\beta_j}, {\sigma^2}) \end{equation*}$\alpha$ and $\beta$ are our categorical variables where $i$ is the level of the first group, and $j$ is the level of the second group. As with one-way ANOVA, R models our categorical variables as factors.
R code:
lm(y ~ factor1 + factor2): testing for main effects without interaction.
lm(y ~ factor1 * factor2): testing for the main effects with interaction.
The following diagram will help us visualize the differences in coefficients with and without interaction between 2 categorical variables. In this first scenario, the difference in the means between groups defined by factor B does not depend on the level of factor A and vice versa. This means that there is no interaction, and the lines between the factor groups are parallel. In the second scenario the difference in the means between groups defined by factor B changes when A2 is present. There is an interaction and the lines are not parallel.
lm() or aov()¶Let's begin looking at our dependent variable cholesterol by assuming there is no interaction between sex and the allele rs174548. We have two options at this point to consider: do we use lm() or aov(). Let's compare the summary output from each.
# Two-way factor model with no interaction
sex_rs174548Fit <- lm(chol ~ as.factor(rs174548) + as.factor(sex), data = cholesterol)
summary(sex_rs174548Fit)
# We have two r(rs174548) 1 and 2, because we have 3 levels in that factor. Sex only has two, so only 1 comparison is shown
Call:
lm(formula = chol ~ as.factor(rs174548) + as.factor(sex), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-66.653 -14.463 -0.601 15.445 57.635
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 175.365 1.786 98.208 < 2e-16 ***
as.factor(rs174548)1 7.236 2.250 3.215 0.00141 **
as.factor(rs174548)2 5.184 4.398 1.179 0.23928
as.factor(sex)1 11.053 2.126 5.199 3.22e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.24 on 396 degrees of freedom
Multiple R-squared: 0.08458, Adjusted R-squared: 0.07764
F-statistic: 12.2 on 3 and 396 DF, p-value: 1.196e-07
# Use aov to generate our linear model
sex_rs174548Fit_aov <- aov(chol ~ as.factor(sex) + as.factor(rs174548), data = cholesterol)
# Look at a summary of the model
summary(sex_rs174548Fit_aov)
Df Sum Sq Mean Sq F value Pr(>F) as.factor(sex) 1 11709 11709 25.951 5.43e-07 *** as.factor(rs174548) 2 4799 2400 5.318 0.00526 ** Residuals 396 178681 451 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm() t-test summaries versus aov() anova tables¶Looking closely at the summary we see there are two different outputs. The summary() of each object produces a slightly different format and result. The lm() summary has provided a series of t-test information looking at the levels of each factor, and how each level influences the value of cholesterol with respect to the baseline mean. Each level is also accompanied by a corresponding p-value as we've seen, and overall the model contains an ajusted R-squared value.
The aov() summary model on the other hand has less but also different information. Instead of notes the the p-values of each factor as a whole, giving us an idea if the term itself is significant to our model. If you were working with many idependent variables, you may be more concerned with discerning which ones actually influence the overal variance your dataset. Note also, the p-values of the sex variable in our two models. Why are they different?
Remember that both of these functions produces an object that contains all of the information for our model. What we need, however, is a way to summarize that information correctly.
Anova() to interpret your results as an ANOVA table¶Regardless of which function you use to generate your models you can generate summary ANOVA tables of each model using the cars::Anova() function. Yes, R does include an anova() function in the base package too! Just be careful in which one you decide to use as the former uses Type-III sum of squares to incorporate error terms while the latter uses Type-I.
# Generate an ANOVA table for both models
anova(sex_rs174548Fit_aov) # Calculate SS with Type-I error
Anova(sex_rs174548Fit_aov) # Calculate SS with Type-III error
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| <int> | <dbl> | <dbl> | <dbl> | <dbl> | |
| as.factor(sex) | 1 | 11709.372 | 11709.3722 | 25.950839 | 5.430902e-07 |
| as.factor(rs174548) | 2 | 4799.139 | 2399.5697 | 5.318035 | 5.258707e-03 |
| Residuals | 396 | 178680.598 | 451.2136 | NA | NA |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | |
| as.factor(sex) | 12194.317 | 1 | 27.025595 | 3.224243e-07 |
| as.factor(rs174548) | 4799.139 | 2 | 5.318035 | 5.258707e-03 |
| Residuals | 178680.598 | 396 | NA | NA |
As you can see, there are slightly different results in terms of calculating the sum of squares error between the two ANOVA methods. This also results in different p-values for the independent variable of sex. Again, if we were working with many more potential independent variables, we might see more of an impact between the two SS types as well.
Going back to the results of our lm() call:
-The estimated mean cholesterol for males in C/C group is the intercept, 175.36 mg/dl.
-The estimated difference in mean cholesterol between females and males controlled for genotype is 11.05 mg/dl.
-The estimated difference in mean between C/G and C/C groups controlled for sex is 7.24 mg/dl.
-The estimated difference in mean between G/G and C/C groups controlled for sex is 5.18 mg/dl.
There is evidence cholesterol level is associated with our rs174548 locus and sex (p < 0.001). This combined model also appears to explain more variability in the data than with just rs174548 (7.7% vs. 1.7%).
How does this compare to the model with sex alone as a predictor? Recall that in the case of comparing models, we should use the anova() function.
# Build a linear model based on the "sex" variable
sexFit <- lm(chol ~ as.factor(sex), data = cholesterol)
summary(sexFit)
Call:
lm(formula = chol ~ as.factor(sex), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-62.299 -14.343 -0.888 14.701 57.701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 178.477 1.522 117.26 < 2e-16 ***
as.factor(sex)1 10.821 2.147 5.04 7.09e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.47 on 398 degrees of freedom
Multiple R-squared: 0.05999, Adjusted R-squared: 0.05763
F-statistic: 25.4 on 1 and 398 DF, p-value: 7.087e-07
# Compare the two models with anova(), NOT Anova()
anova(sexFit, sex_rs174548Fit)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 398 | 183479.7 | NA | NA | NA | NA |
| 2 | 396 | 178680.6 | 2 | 4799.139 | 5.318035 | 0.005258707 |
In conclusion, our combined non-interacting model explains more variability in our response variable vs looking at sex or rs174548 alone and while sex alone explains a good deal of variability, the combined model of both variables is still significantly different.
While we have concluded there is a difference between our two main ANOVA models, we have to address the possibility that there may be an interaction between these two terms in our model. Therefore we should now check the two-way ANOVA model with the interaction.
# Build an interaction model with two factors with aov()
sexInt_rs174548Fit_aov <- aov(chol ~ as.factor(sex) * as.factor(rs174548), data = cholesterol)
summary(sexInt_rs174548Fit_aov)
# Look at the Type-III SS summary
Anova(sexInt_rs174548Fit_aov, Type="III")
Df Sum Sq Mean Sq F value Pr(>F) as.factor(sex) 1 11709 11709 26.378 4.42e-07 *** as.factor(rs174548) 2 4799 2400 5.405 0.00483 ** as.factor(sex):as.factor(rs174548) 2 3779 1889 4.256 0.01483 * Residuals 394 174902 444 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | |
| as.factor(sex) | 12194.317 | 1 | 27.470072 | 2.606525e-07 |
| as.factor(rs174548) | 4799.139 | 2 | 5.405498 | 4.831173e-03 |
| as.factor(sex):as.factor(rs174548) | 3778.947 | 2 | 4.256407 | 1.483039e-02 |
| Residuals | 174901.651 | 394 | NA | NA |
aov model¶According to our Anova() table of the model, it does appear that there is a statistically significant interaction between sex and rs174548. If we are interested in the specific interaction, we can take a closer look with lm() in this case since we are only using a single interaction term in our model.
# Build an interaction model with two factors with lm()
sexInt_rs174548Fit <- lm(chol ~ as.factor(sex) * as.factor(rs174548), data = cholesterol)
summary(sexInt_rs174548Fit)
Call:
lm(formula = chol ~ as.factor(sex) * as.factor(rs174548), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-70.529 -13.604 -0.974 14.171 54.882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 178.1182 2.0089 88.666 < 2e-16 ***
as.factor(sex)1 5.7109 2.7982 2.041 0.04192 *
as.factor(rs174548)1 0.9597 3.1306 0.307 0.75933
as.factor(rs174548)2 -0.2015 6.4053 -0.031 0.97492
as.factor(sex)1:as.factor(rs174548)1 12.7398 4.4650 2.853 0.00456 **
as.factor(sex)1:as.factor(rs174548)2 10.2296 8.7482 1.169 0.24297
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.07 on 394 degrees of freedom
Multiple R-squared: 0.1039, Adjusted R-squared: 0.09257
F-statistic: 9.14 on 5 and 394 DF, p-value: 3.062e-08
-The estimated mean cholesterol for males in the C/C group is 178.12 mg/dl.
-The estimated mean cholesterol for females in the C/C group is (178.12 + 5.71) mg/dl.
-The estimated mean cholesterol for men in the C/G group (178.12 + 0.96) mg/dl.
-The estimated mean cholesterol for females in the C/G group is (178.12 + 5.71 + 0.96 + 12.74) mg/dl.
There appears to be a significant interaction between being female and having the C/G genotype (p<0.01).
Let's compare the 2 two-way ANOVA models we've generated.
# anova to compare the 2 models of two-way ANOVA
...(sex_rs174548Fit, sexInt_rs174548Fit_aov)
There is evidence that these two models are different (p = 0.01483). Let's compare all three models as a last step.
# Use AIC to help compare all three models we've generated
...(sexFit, sex_rs174548Fit, sexInt_rs174548Fit)
TukeyHSD()¶We've generated 3 models to explain our cholesterol values in terms of sex and the genetic locus rs174548.
| Independent variables | Model | term relationship | Adjusted R-squared | F-value | p-value | AIC |
|---|---|---|---|---|---|---|
| sex | One-way ANOVA | NA | 0.05763 | 25.4 | 7.087e-07 | 3592.509 |
| sex, rs174548 | Two-way ANOVA | additive | 0.07764 | 12.2 | 1.196e-07 | 3585.907 |
| sex, rs174548 | Two-way ANOVA | interacting | 0.09257 | 9.14 | 3.062e-08 | 3581.357 |
It looks like our third model that accounts for interaction between sex and rs174548 explains the most variation. To investigate the story further we would want to look at which interactions are possibly the most significant.
Since we already have an ANOVA model and wish to make multiple comparisons, we can use the TukeyHSD() function (Tukey's Honestly Significant Differences) to identify which levels may interact across the factors in our model. This function works best on a balanced design but can adjust for mildly unbalanced designed as well.
The procedure creates a critical value based on your design and desired $\alpha$ level so the result already takes into account the multiple comparisons generated to provide an adjusted p-value. To recap, we'll need:
aov model object.# Run TukeyHSD on our aov model
...(sexInt_rs174548Fit_aov)
TukeyHSD() output¶From above we can see our different groups compared to each other, with p-values adjusted for multiple comparisons. Looking directly at the interaction portion of the analysis, we can ascertain that there are 3 highly significant interacting groups, and one additional group with significant results.
| Group 1 | Group 2 | Difference in means | Adjusted p-value |
|---|---|---|---|
| Females: C/G | Males: C/C | 19.41 mg/dl | 0.0000001 |
| Females: C/G | Females: C/C | 13.69 mg/dl | 0.0003054 |
| Females: C/G | Males: C/G | 18.45 mg/dl | 0.0000028 |
| Females: C/G | Males: G/G | 19.61 mg/dl | 0.0360298 |
It would appear that females with the C/G genotype may have a significantly increased cholesterol value across all of our samples. This would, however, require deeper analysis to confirm.
Note that if we had used TukeyHSD() on an additive model of our categorical variables, we would get a similar looking output measuring the differences between different factor levels but it would not include the a:b grouped comparisons.
The analysis of covariance (ANCOVA) model allows us to compare the means of a dependent variable between two or more groups while taking into account variability of other variables (covariates). ANCOVA, therefore, uses a continuous variable as a linear regression component (covariate). In simpler terms, the ANCOVA builds an ANOVA model to explain the between-group variance but incorporates the covariates to help explain the within-group variance.
This allows us to ask the question:
Is the relationship between age and cholesterol affected by sex?
Our model won't look that different from our other equations except that we have categorical and continuous predictor variables.
Expression:
\begin{equation*}\Large Y \verb|~| Normal({\beta_i} + {\beta_{i}x}, {\sigma^2}) \end{equation*}Parameters are the intercept of the first factor level, the slope with respect to $x$ for the first factor level, the differences in the intercepts for each factor level other than the first, and the differences in slopes for each factor level other than the first.
R code:
lm(y ~ f + x), testing for main effects without interaction.
lm(y ~ f*x), testing for main effects with interaction.
To answer our question, let's first take a quick look at sex differences in cholesterol in our dataset, keeping in mind that males are encoded as 0 and females as 1. Based on sex information alone, we see that women have a higher mean serum cholesterol, but we don't know if this is significant.
# Make a boxplot of cholesterol grouped by sex
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = as.factor(sex), y = chol) +
# 4. Geoms
geom_boxplot() +
geom_jitter()
Before moving on, there are some conditions we should look into before beginning our ANCOVA:
Linearity between the covariate and the outcome variable at each level of the grouping variable. In this case we are referring to cholesterol (outcome) versus age (covariate) when grouped by sex.
Homogeneity of regression slopes. This assumes no interaction between the independent variable and covariate. In other words, the b coefficients must be equal amongst all sub-groups.
The residuals of the outcome variable should be approximately normally distributed.
Homoscedasticity. Remember we want homogeneous residual variance for all groups.
No significant outliers in the group.
We've been looking at this data quite a bit, but let's remind ourselves that there remains linearity between age and cholesterol when grouping by sex. We can generate a scatterplot with regression lines to help us visually assess the situation.
# Plot age vs cholesterol and group by sex
ggplot(cholesterol) +
aes(x = age, y= chol, color = factor(sex)) +
geom_point() +
stat_smooth(method = "lm")
From above, our visual inspection of the scatterplot suggests that we still see a trend between increasing age and cholesterol even when grouping by sex. Sex doesn't change the relationship between age and cholesterol as these lines are almost parallel ie no interaction. We can also see females when compared to males, on average, appear to have a higher mean cholesterol score with increasing age.
For simplicty, we can build a model based on an interaction between sex and age using aov() and look at the results.
# check for homoegeneity with aov() and interaction between sex and age
summary(aov(chol ~ ..., data = cholesterol))
From above, we see that both sex and age appear to have a significant effect on cholesterol values. This we have already seen in our previous models. We can also see from our interaction test "sex:age" that there is a low F-value and the associated p-value is high so no significant interaction occurs between the two variables.
From here, we'll assume the remaining requirements 3-5 have been met. Let's take a look at the model itself.
# Make a multiple linear regression with a continuous and categorical variable, without interaction
ageSexFit <- lm(chol ~ ..., data = cholesterol)
# Check the results of our model
summary(ageSexFit)
Controlling for sex, mean cholesterol increases by 0.29697 mg/dl for an additional year of age. This is close to the slope for our model of cholesterol alone, 0.31 mg/dl. This does not necessarily mean that the age/cholesterol relationship is the same in males and females; we need to check out the interaction term. There appears to be an increase in mean serum cholesterol of 10.5 mg/dl in females over males.
Let's build an lm model that includes interaction between age and sex.
# Build a new interaction model between cholesterol and age
ageIntSexFit <- lm(chol ~ ..., data = cholesterol)
summary(ageIntSexFit)
Males are coded as 0 and females are coded as 1 in this model.
The intercept term is the mean serum cholesterol for MALES at age 0 (160.31 mg/dl). The slope term for age is the difference in mean cholesterol associated with a one year change in age for MALES. The slope for sex (14.56 mg/dl) is the difference in mean cholesterol between males and females at age 0. The interaction term is the difference in the change in mean cholesterol associated with each one year change in age for females compared to males. Sex exerts a small and not statistically significant effect on the age/cholesterol relationship.
Let's compare the results of our models with and without an interaction term.
| ANCOVA Model | Male Intercept | Male slope | Female Intercept | Female Slope |
|---|---|---|---|---|
| No Interaction | 162.35 | 0.29 | 172.85 | 0.29 |
| WIth Interaction | 160.31 | 0.33 | 174.87 | 0.26 |
They seem quite similar and we also already know from testing for our conditions, that there is no significant interaction between age and sex. Still, let's confirm by comparing the two models directly.
# anova comparison of the two interaction models
anova(ageSexFit, ageIntSexFit)
# AIC comparison of the two interaction models
AIC(ageSexFit, ageIntSexFit)
In conclusion, adding the interaction term did not change the model significantly. Does our ANCOVA model ageSexFit grouping cholesterol values by sex and using age as a covariate improve on our one-way ANOVA model sexfit where we predict cholesterol using only sex?
# Compare the two models with anova
anova(sexFit, ageSexFit)
# Use AIC to compare the models
AIC(sexFit, ageSexFit)
Indeed the addition of our age covariate does result in a very different model from predicting by sex alone. Furthermore, the addition of age improves on the model based on a number of indicators including the AIC comparison of the models.
Before we move on, I want to take a step back and quickly review the models and code we've gone through today. Firstly, with our example dataset, and then more generally. I hope you can see that although conceptually different, getting a handle on the code isn't too bad.
For all of these models we are trying to determine the effect of different variables on cholesterol. The differences are whether we are using:
*) between our input variables.| model | categorical | continuous | R code |
|---|---|---|---|
| simple linear regression | X | $\checkmark$ | lm(chol ~ age) |
| multiple linear regression | X | $\checkmark\checkmark$ | lm(chol ~ age + BMI) lm(chol ~ age*BMI) |
| one-way analysis of variance (ANOVA) | $\checkmark$ | X | lm(chol ~ as.factor(rs174548)) |
| multi-way analysis of variance (ANOVA) | $\checkmark\checkmark$ | X | lm(chol ~ as.factor(sex) + as.factor(rs174548) lm(chol ~ as.factor(sex))*as.factor(rs174548)) aov(chol ~ as.factor(sex) + as.factor(rs174548) aov(chol ~ as.factor(sex))*as.factor(rs174548)) |
| analysis of covariance (ANCOVA) | $\checkmark$ | $\checkmark$ | lm(chol ~ as.factor(sex) + age) lm(chol ~ as.factor(sex)*age) |
Remember that you can produce ANOVA tables using Type-III sum of squares using the car::Anova() function. We have started with models that assume normally distributed errors, but there models left unexplored in this lecture.
In the table below, our R code for each of the models has been generalized. Here, $y$ is our predictor variable, $x$ is a continuous variable, and $f$ is a categorical variable (assumed to be a factor already).
| model | R code |
|---|---|
| simple linear regression | lm(y ~ x) |
| multiple linear regression | lm(y ~ x + I(x^2) lm(y ~ x1 + x2) lm(y ~ x1*x2) |
| one-way analysis of variance (ANOVA) | lm(y ~ f) |
| multi-way analysis of variance (ANOVA) | lm(y ~ f1 + f2) lm(y ~ f1*f2) |
| analysis of covariance (ANCOVA) | lm(y ~ f + x) lm(y ~ f*x) |
You need not memorize any of these charts - you may just want to use them to orient yourself in the future. Much of the R code seems the same whether you are doing multiple linear regression, ANOVA or ANCOVA, so it is good to have a reference point.
A non-exhaustive but fair summary of what we've discussed today. This generally applies to the small corner of linear regression we've covered today.
rs174548 and age from our data¶Does the effect of the genetic factor rs174548 differ depending on a subject's age?
This sounds like we want to use ANCOVA since we have a categorical (rs174548) and continuous (age) varaible. Let's break the process into steps again:
Make a plot of age versus cholesterol and color points by genotype. Add a linear model to the plot. Are you expecting an interaction based on this plot?
# Plot age vs cholesterol and colour by genotype
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(age, chol, color = as.factor(rs174548)) +
# 4. Geoms
geom_point() +
# 5. Statistics
stat_smooth(method = "lm", se = FALSE)
There appear to be some non-parallel slopes going on with our visualization which suggests there could be an interaction between our covariate and independent variable.
Time to build our models for comparison against each other. We will use aov() to produce both an additive and interacting model involving rs174548 and age. We'll begin with the interacting model to see if it follows with our observations thus far.
# Check the homogeneity of the regression slopes
rs174548IntAgeFit <- aov(chol ~ as.factor(rs174548) ... age, data=cholesterol)
summary(rs174548IntAgeFit)
Although the slopes aren't exactly parallel between all the rs174548 genotypes, there is no significant interaction detected. Take a look at the ANCOVA as an additive model.
#model cholesterol by rs174548 and age without interaction
rs174548AgeFit <- lm(chol ~ as.factor(rs174548) ... age, data=cholesterol)
summary(rs174548AgeFit)
Look at the summary statistics for each model fit. How would you interpret the results? Notice that level 2 of rs174548 (ie the blue line in our visualization) has no significant p-value versus the baseline level of 0.
Let us compare the two models with an analysis of variance table to make sure that interaction doesn't really add any information to our model. Then we can compare against some simpler models using just age and rs174548.
#compare models
anova(rs174548AgeFit, rs174548IntAgeFit)
AIC(ageFit, rs174548Fit, rs174548AgeFit, rs174548IntAgeFit)
Using AIC, we see that modeling cholesterol variation using the additive model of age and rs174548 explains more variation than using age or rs174548 alone.
Let's dig deeper into our data and see what the genetic locus APOE has on cholesterol values. Some questions we can ask and tasks we can perform:
# Boxplot of cholesterol when grouping by APOE genotypes
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = as.factor(...), y = chol) +
# 4. Geoms
geom_boxplot() +
geom_jitter()
Looking at the visualization, we should note a couple of things:
APOE alone¶Recall there are two ways we can build our ANOVA to look at the effect of each genotype. We can either compare it to a reference genotype (level 1, e2/e2) or we can produce the intercept for each group level by including a -1 in our model formula.
# Build a one-way ANOVA model using APOE as a predictor for cholesterol
ApoeFit <- lm(chol ~ ..., data = cholesterol)
summary(ApoeFit)
# Which of our APOE genotypes influences cholesterol significantly?
# Build a model
ApoeForceFit <- lm(chol ~ as.factor(APOE) ..., data = cholesterol)
summary(ApoeForceFit)
From our first glimpse we see that the all of the levels are different from the base reference genotype of e2/e2. Do you recall how we can compare all of the groups against each other?
Recall we can build a contrast matrix - a simple way of describing which groups should be directly compared to each other. We will again use the contrMat() function in the format of a Tukey HSD. From there we can just pass this on to the glht() function for comparison.
# Generate a contrast matrix for post-hoc Tukey testing
Mt <- contrMat(table(cholesterol$APOE), type="Tukey")
Mt
# General linear hypothesis testing ensues
APOEmtPosthoc <- glht(ApoeForceFit, linfct = Mt)
# Look at all of the (15) group comparisons to see which are significant
summary(APOEmtPosthoc, test=adjusted("none")) # No adjustment for multiple comparisons
summary(APOEmtPosthoc, test=adjusted("bonferroni")) # Strict correction
summary(APOEmtPosthoc, test=adjusted("BH")) # FDR correction
TukeyHSD() on your unbalanced design¶We we previously suggested that Tukey HSD works best on a balanced design, there is actually a secondary implementation known as the Tukey-Kramer test. The TukeyHSD() function actually implements this under the hood so we could skip the contrast matrix step altogether but we don't get to dictate how our multiple comparisons are corrected.
# Step 1: build an aov object for our model
APOE_aov <- ...(chol ~ as.factor(APOE), data = cholesterol)
# Check out the ANOVA table
Anova(APOE_aov, Type="III")
# Perform multiple comparisons
...(APOE_aov)
The APOE genotype has a significant effect on cholesterol with the most significant differences being between genotype 1 of the homozygous $\Large\varepsilon$2 alleles and genotype combinations of the $\Large\varepsilon$3 and $\Large\varepsilon$4 alleles. APOE genotype 5 is also significantly different from genotypes 2 and 4. However, the significant difference between genotype 5 and 2 is lost after multiple test correction.
Does this effect change with age or does it hold when age is held constant? Let's investigate further.
We'll want to
APOE and age.APOE.# Plot age versus cholesterol grouping by APOE
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(x = age, y = chol, color = factor(APOE)) +
# 4. Geoms
geom_point() +
# 5. Statistics
stat_smooth(method = "lm", se = FALSE) # Drop the confidence intervals to improve visual analysis
It looks as though we are in a similar situation as our last analysis where it looks as though there may be some interaction with between age and APOE based on the regression we've added. We'll investigate further by building the interacting model first with aov().
# Is there a significant interaction between age and APOE genotype?
APOEIntAgeFit <- aov(chol ~ ..., data=cholesterol)
# Bring up the ANOVA table
Anova(APOEIntAgeFit, Type = "III")
lm()¶From the analysis, there appears to be no significant interaction between APOE and the covariate of age so we can proceed to build an ANCOVA without interaction. We'll build our additive model using the lm() function so we can quickly see how each subgroup compares to the reference genotype.
# Make an ANCOVA model with age and APOE as predictors in an additive model
APOEAgeFit <- ...(chol ~ age + as.factor(APOE), data=cholesterol)
summary(APOEAgeFit)
Anova(APOEAgeFit)
Now it's time to return to our various models for comparison again. How different are the additive versus interacting ANCOVA models? How does that compare to looking at the ANOVA model with APOE alone or age alone? We can finish up with a call to AIC() for confirmation.
# Compare our ANCOVA with and without interaction
anova(..., APOEIntAgeFit)
# Compare our best ANCOVA against the one-way model using only APOE as a predictor
anova(APOE_aov, APOEAgeFit)
# Compare our best ANCOVA against a model using only age as a predictor
anova(ageFit, APOEAgeFit)
# Check with AIC against our models
...(ageFit, APOE_aov, APOEAgeFit, APOEIntAgeFit)
The APOE genotype has a significant effect on cholesterol. This relationship remains while holding age constant. There is no significant interaction between APOE genotype and age. Our additive model performs better than one that includes an interaction between our independent variable and the covariate. Therefore the 'best' model incorporates both APOE genotype and age as an additive model.
That's the end for our sixth and penultimate class on R! You've made it through linear models and we've learned about the following:
Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete all chapters from the Correlation and Regression in R course (4,200 points total). This is a pass-fail assignment, and in order to pass you need to achieve a least 3,150 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:
You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, October 21st to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.
https://github.com/ttimbers/lm_and_glm/blob/master/lm_and_glm.Rmd
https://github.com/seananderson/glmm-course
http://michael.hahsler.net/SMU/EMIS7331/R/regression.html
https://ms.mcmaster.ca/~bolker/emdbook/book.pdf
http://www.differencebetween.net/science/mathematics-statistics/difference-between-ancova-and-regression/
https://stats.stackexchange.com/questions/77563/linear-model-fit-seems-off-in-r
https://www.biostat.washington.edu/suminst/archives/SISG2016/SM1604
https://ms.mcmaster.ca/~bolker/
http://www.mathnstuff.com/math/spoken/here/2class/90/htest.htm
When predicting values you are assuming that your model is true. This might be fair within the range of your data. This is to be interpreted with caution outside the range of your data. For example, polynomial data may look linear over a certain range.
The predict() function works with many different kinds of fits: not just linear models but nonlinear, polynomial, generalized linear models, etc. predict will try to guess the fit based on the object input, but this information can be specified using predict.lm. The help page for predict.lm is more useful as it is specific for the linear model fit.
Use predict.lm() to predict the mean cholesterol at age 47 from our model object ageFit. Recall that ageFit is our first model:
lm(chol ~ age, data = cholesterol).
predict.lm(ageFit, newdata = data.frame(age=47))
In addition to the linear model, the function needs the newdata that we want to predict. Note that newdata takes in a data frame. We can predict the mean cholesterol at age 47 within a confidence interval that can be specified using level. The output is the mean, as well as the upper and lower boundaries of the estimate.
predict.lm(ageFit, newdata = data.frame(age=47), interval = "confidence")
| fit | lwr | upr | |
|---|---|---|---|
| 1 | 181.4874 | 179.0619 | 183.9129 |
We can also use a 'prediction' interval.
predict.lm(ageFit, newdata = data.frame(age=47), interval = "prediction")
| fit | lwr | upr | |
|---|---|---|---|
| 1 | 181.4874 | 138.7833 | 224.1915 |
Notice the difference in the upper and lower boundaries for these predictions. The first is the prediction for the mean serum cholesterol for individuals of age 47 and the second is for a single new individual of age 47. The second prediction has to account for random variability around the mean, rather than just the precision of the estimate of the mean. This may seem like a subtle difference, but as you can see it can change our boundaries quite a bit - we need to be clear on the question we are asking.
For our multiple linear regression model explaining cholesterol as a function of age and BMI (ageBMIFit), we could ask what cholesterol is predicted to be for a 60-year-old at a BMI of 21, a 60-year-old at a BMI of 26, and a 60-year-old at a BMI of 30. The standard error on the estimate of your means is obtained by setting se.fit = TRUE.
predict.lm(ageBMIFit, newdata = data.frame(BMI = c(21,26,30), age = 60), interval = "prediction", se.fit = TRUE)
| fit | lwr | upr | |
|---|---|---|---|
| 1 | 179.2557 | 137.1078 | 221.4036 |
| 2 | 186.3885 | 144.3676 | 228.4095 |
| 3 | 192.0947 | 149.9339 | 234.2556 |
Residuals are the differences between the observed response and the predicted response, and can be used to identify poorly fit data points, unequal variance (heteroscedasticity), nonlinear relationships, and examine the normality assumption.
We can plot the residuals vs $x$, residuals vs $y$, or a histogram of the residuals to see if there are any patterns. For example, plotting residuals against $x$ (age), should be unstructured and centered at 0.
If the residuals look like they are grouped in one section of the plot, or follow a pattern, then the model is not a good fit (ie. looks quadratic - you would have a nonlinear association). If it looks like a sideways tornado, then errors are increasing with $x$, and this is non-constant variance.
The residuals are found in our lm object, ageFit, which also contains the inputs of our model; it is a list of 12. You'll see that $residuals are in the second entry of our structured list.
str(ageFit)
List of 12 $ coefficients : Named num [1:2] 166.9 0.31 ..- attr(*, "names")= chr [1:2] "(Intercept)" "age" $ residuals : Named num [1:400] 25.13 21.27 18.24 4.55 -8.04 ... ..- attr(*, "names")= chr [1:400] "1" "2" "3" "4" ... $ effects : Named num [1:400] -3678.3 89.45 17.61 1.86 -9.49 ... ..- attr(*, "names")= chr [1:400] "(Intercept)" "age" "" "" ... $ rank : int 2 $ fitted.values: Named num [1:400] 190 183 187 177 183 ... ..- attr(*, "names")= chr [1:400] "1" "2" "3" "4" ... $ assign : int [1:2] 0 1 $ qr :List of 5 ..$ qr : num [1:400, 1:2] -20 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:400] "1" "2" "3" "4" ... .. .. ..$ : chr [1:2] "(Intercept)" "age" .. ..- attr(*, "assign")= int [1:2] 0 1 ..$ qraux: num [1:2] 1.05 1.02 ..$ pivot: int [1:2] 1 2 ..$ tol : num 1e-07 ..$ rank : int 2 ..- attr(*, "class")= chr "qr" $ df.residual : int 398 $ xlevels : Named list() $ call : language lm(formula = chol ~ age, data = cholesterol) $ terms :Classes 'terms', 'formula' language chol ~ age .. ..- attr(*, "variables")= language list(chol, age) .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. ..$ : chr [1:2] "chol" "age" .. .. .. ..$ : chr "age" .. ..- attr(*, "term.labels")= chr "age" .. ..- attr(*, "order")= int 1 .. ..- attr(*, "intercept")= int 1 .. ..- attr(*, "response")= int 1 .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. ..- attr(*, "predvars")= language list(chol, age) .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. ..- attr(*, "names")= chr [1:2] "chol" "age" $ model :'data.frame': 400 obs. of 2 variables: ..$ chol: int [1:400] 215 204 205 182 175 176 159 169 175 189 ... ..$ age : int [1:400] 74 51 64 34 52 39 79 38 52 58 ... ..- attr(*, "terms")=Classes 'terms', 'formula' language chol ~ age .. .. ..- attr(*, "variables")= language list(chol, age) .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. .. ..$ : chr [1:2] "chol" "age" .. .. .. .. ..$ : chr "age" .. .. ..- attr(*, "term.labels")= chr "age" .. .. ..- attr(*, "order")= int 1 .. .. ..- attr(*, "intercept")= int 1 .. .. ..- attr(*, "response")= int 1 .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. .. ..- attr(*, "predvars")= language list(chol, age) .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. .. ..- attr(*, "names")= chr [1:2] "chol" "age" - attr(*, "class")= chr "lm"
We can use the broom() package to get information out of linear model objects into the glorious dataframe format that we know and love. This is done using the augment function.
# Load up the broom library to help clean up our model information
# library(broom)
ageFit.df <- augment(ageFit)
str(ageFit.df)
head(ageFit.df)
tibble [400 × 8] (S3: tbl_df/tbl/data.frame) $ chol : int [1:400] 215 204 205 182 175 176 159 169 175 189 ... $ age : int [1:400] 74 51 64 34 52 39 79 38 52 58 ... $ .fitted : num [1:400] 190 183 187 177 183 ... $ .resid : num [1:400] 25.13 21.27 18.24 4.55 -8.04 ... $ .hat : num [1:400] 0.00693 0.00268 0.00351 0.00772 0.0026 ... $ .sigma : num [1:400] 21.7 21.7 21.7 21.7 21.7 ... $ .cooksd : num [1:400] 0.004717 0.001294 0.001251 0.000172 0.000179 ... $ .std.resid: num [1:400] 1.163 0.982 0.842 0.21 -0.371 ... - attr(*, "terms")=Classes 'terms', 'formula' language chol ~ age .. ..- attr(*, "variables")= language list(chol, age) .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. ..$ : chr [1:2] "chol" "age" .. .. .. ..$ : chr "age" .. ..- attr(*, "term.labels")= chr "age" .. ..- attr(*, "order")= int 1 .. ..- attr(*, "intercept")= int 1 .. ..- attr(*, "response")= int 1 .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> .. ..- attr(*, "predvars")= language list(chol, age) .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" .. .. ..- attr(*, "names")= chr [1:2] "chol" "age"
| chol | age | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|
| <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 215 | 74 | 189.8664 | 25.133558 | 0.006926541 | 21.67724 | 4.716697e-03 | 1.1629645 |
| 204 | 51 | 182.7287 | 21.271254 | 0.002675863 | 21.68783 | 1.294058e-03 | 0.9821506 |
| 205 | 64 | 186.7631 | 18.236904 | 0.003513746 | 21.69480 | 1.251141e-03 | 0.8424005 |
| 182 | 34 | 177.4531 | 4.546943 | 0.007718507 | 21.71296 | 1.722974e-04 | 0.2104773 |
| 175 | 52 | 183.0391 | -8.039081 | 0.002595885 | 21.71041 | 1.792801e-04 | -0.3711709 |
| 176 | 39 | 179.0047 | -3.004730 | 0.005513219 | 21.71364 | 5.350503e-05 | -0.1389342 |
Now that we have a data frame we can plot our residuals (.resid) versus our fitted data (.fitted).
# Plot the residuals from our ageFit model
ggplot(ageFit.df) +
aes(x=.fitted, y=.resid) +
geom_point() +
geom_hline(yintercept=0, color="black")
You can plot the fitted and residual values with a categorical variable, but it is sometimes difficult to view patterns. For example, here is what plotting the residuals for our model of cholesterol as a function of genotype would look like.
# rs174548Fit <- lm(chol ~ as.factor(rs174548), data = cholesterol)
names(rs174548Fit)
names(ageFit)
length(rs174548Fit$residuals)
# Note that we provide the original dataset here in the our call to augment.
# Otherwise it fails to provide residuals for reason.
rs174548Fit.df <- augment(rs174548Fit, data=cholesterol)
#names(datanfit1)[2] = "rs174548"
#datanfit1$.resid = anfit1[[2]]
head(rs174548Fit.df)
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE | age_group | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <dbl> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 1 | 74 | 215 | 26.2 | 367 | 1 | 2 | 4 | 1 | 187.8639 | 27.136054 | 0.006802721 | 21.91199 | 3.520671e-03 | 1.2417946 |
| 2 | 1 | 51 | 204 | 24.7 | 150 | 2 | 1 | 4 | 0 | 186.5000 | 17.500000 | 0.038461538 | 21.93634 | 8.832626e-03 | 0.8139084 |
| 3 | 0 | 64 | 205 | 24.2 | 213 | 0 | 1 | 4 | 1 | 181.0617 | 23.938326 | 0.004405286 | 21.92154 | 1.765700e-03 | 1.0941410 |
| 4 | 0 | 34 | 182 | 23.8 | 111 | 1 | 1 | 1 | 0 | 187.8639 | -5.863946 | 0.006802721 | 21.95267 | 1.644038e-04 | -0.2683447 |
| 5 | 1 | 52 | 175 | 34.1 | 328 | 0 | 0 | 1 | 0 | 181.0617 | -6.061674 | 0.004405286 | 21.95254 | 1.132178e-04 | -0.2770589 |
| 6 | 1 | 39 | 176 | 22.7 | 53 | 0 | 2 | 4 | 0 | 181.0617 | -5.061674 | 0.004405286 | 21.95319 | 7.894374e-05 | -0.2313522 |
# Plot the residuals from rs174548Fit.df
# 1. Data
ggplot(rs174548Fit.df) +
# 2. Aesthetics
aes(x=.fitted, y=.resid, colour = as.factor(rs174548)) +
# 4. Geoms
geom_point() +
geom_hline(yintercept=0, color="black")
Instead, you can perform a statistical test of equal variance.
Bartlett's test can test whether or not population (group) variances are the same. We can see if variances are equal in our model of cholesterol as a function of sex by inputting the formula and dataset into the bartlett.test() function.
bartlett.test(chol ~ factor(rs174548), data = cholesterol)
Bartlett test of homogeneity of variances data: chol by factor(rs174548) Bartlett's K-squared = 4.8291, df = 2, p-value = 0.08941
The p-value is telling us that the variance is not statistically different between our populations. Our assumption of equal variance is valid.
QQ-plots (quantile-quantile) are a tool to answer the question: Does our data plausibly come from the (normal) distribution? The data is plotted against a theoretical distribution. Points should fall on the straight line. Any data points not fitting are moving away from the distribution.
The stat_qq geom from ggplot2 allows us to plot our residuals along the y-axis in ascending order, and theoretical quantiles of a normal distribution along the x-axis. A straight line can be added to see where residuals fall with stat_qq_line.
# Generate a qq-plot of the residuals
# 1. Data
ggplot(rs174548Fit.df) +
# 2. Aesthetics
aes(sample = .resid) +
# 5. Statistics
stat_qq() +
stat_qq_line()
This looks pretty straight. We likely have normality of errors.
Let's try a less perfect example and look at the relationship between age and triglycerides (TG). Make a scatterplot of age and triglycerides with a linear fit to take a look at the data.
## plot age vs triglycerides
# 1. Data
ggplot(cholesterol) +
# 2. Aesthetics
aes(age, TG) +
# 4. Geoms
geom_point() +
# 5. Statistics
stat_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
Write a linear model for triglyceride levels as a function of age. Use broom to get the output of the lm object into data frame format.
# Generate a linear model examining the effect of age on TG
TGFit <- lm(TG ~ age, data = cholesterol)
# generate a data frame summary from the model
TGFit.df <- augment(TGFit)
Plot the residuals against the fitted values. Does the variance look equal across the residuals?
# Plot the residuals from TGFit.df
# 1. Data
ggplot(TGFit.df) +
# 2. Aesthetics
aes(.fitted, .resid) +
# 4. Geoms
geom_point() +
geom_hline(yintercept=0, color="black")
Our residuals are increasing with increasing values of $x$ (.fitted).
What do the residuals look like in a qq-plot?
# qq-plot of the TG model residuals
# 1. Data
ggplot(TGFit.df) +
# 2. Aesthetics
aes(sample = .resid) +
# 5. Statistics
stat_qq() +
stat_qq_line()
Our qqplot points are deviating from the line suggesting a poor fit for our model.
For a) the anova model of the effect of the genetic factor APOE on cholesterol levels and b) the ancova model of age + APOE genotype on cholesterol levels: can you use any tools to assess whether the assumptions of your model are accurate?
# Build our models first
# ApoeFit <- lm(chol ~ as.factor(APOE), data = cholesterol)
summary(ApoeFit)
# APOEAgeFit <- lm(chol ~ age + as.factor(APOE), data=cholesterol)
summary(APOEAgeFit)
Call:
lm(formula = chol ~ as.factor(APOE), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-66.88 -13.88 -0.46 15.12 60.12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 167.784 2.935 57.162 < 2e-16 ***
as.factor(APOE)2 16.352 5.347 3.058 0.002378 **
as.factor(APOE)3 23.882 6.157 3.879 0.000123 ***
as.factor(APOE)4 16.098 3.224 4.993 8.94e-07 ***
as.factor(APOE)5 27.688 4.075 6.795 4.02e-11 ***
as.factor(APOE)6 23.516 7.250 3.244 0.001280 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.96 on 394 degrees of freedom
Multiple R-squared: 0.113, Adjusted R-squared: 0.1018
F-statistic: 10.04 on 5 and 394 DF, p-value: 4.616e-09
Call:
lm(formula = chol ~ age + as.factor(APOE), data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-60.066 -14.163 0.195 13.966 59.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 151.56305 4.72094 32.104 < 2e-16 ***
age 0.30984 0.07158 4.329 1.91e-05 ***
as.factor(APOE)2 15.56004 5.23358 2.973 0.00313 **
as.factor(APOE)3 23.80580 6.02298 3.952 9.17e-05 ***
as.factor(APOE)4 14.96826 3.16465 4.730 3.14e-06 ***
as.factor(APOE)5 27.49356 3.98641 6.897 2.13e-11 ***
as.factor(APOE)6 23.74898 7.09187 3.349 0.00089 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.51 on 393 degrees of freedom
Multiple R-squared: 0.1534, Adjusted R-squared: 0.1405
F-statistic: 11.87 on 6 and 393 DF, p-value: 2.993e-12
# Put our ANOVA model into tabular format
# Also provide our original dataset here
ApoeFit.df <- augment(ApoeFit, data=cholesterol)
#datapofit$.resid = apofit$residuals
#names(datapofit)[2] = "APOE"
head(ApoeFit.df)
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE | age_group | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <dbl> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 1 | 74 | 215 | 26.2 | 367 | 1 | 2 | 4 | 1 | 183.8826 | 31.117409 | 0.004048583 | 20.92953 | 1.499068e-03 | 1.4874894 |
| 2 | 1 | 51 | 204 | 24.7 | 150 | 2 | 1 | 4 | 0 | 183.8826 | 20.117409 | 0.004048583 | 20.96390 | 6.265542e-04 | 0.9616621 |
| 3 | 0 | 64 | 205 | 24.2 | 213 | 0 | 1 | 4 | 1 | 183.8826 | 21.117409 | 0.004048583 | 20.96138 | 6.903921e-04 | 1.0094646 |
| 4 | 0 | 34 | 182 | 23.8 | 111 | 1 | 1 | 1 | 0 | 167.7843 | 14.215686 | 0.019607843 | 20.97605 | 1.563701e-03 | 0.6849162 |
| 5 | 1 | 52 | 175 | 34.1 | 328 | 0 | 0 | 1 | 0 | 167.7843 | 7.215686 | 0.019607843 | 20.98532 | 4.028777e-04 | 0.3476540 |
| 6 | 1 | 39 | 176 | 22.7 | 53 | 0 | 2 | 4 | 0 | 183.8826 | -7.882591 | 0.004048583 | 20.98476 | 9.619501e-05 | -0.3768074 |
# Plot our anova model
# 1. Data
ggplot(ApoeFit.df) +
# 2. Aesthetics
aes(.fitted, .resid, colour=as.factor(APOE)) +
# 4. Geoms
geom_point() +
geom_hline(yintercept=0, color="black")
# Test for population variance in triglycerides between haplotypes
bartlett.test(chol ~ factor(APOE), data = cholesterol)
Bartlett test of homogeneity of variances data: chol by factor(APOE) Bartlett's K-squared = 13.76, df = 5, p-value = 0.01721
# Put our ANCOVA into tabular format
ApoeAgeFit.df <- augment(APOEAgeFit, data=cholesterol)
#datapoagefit$.resid = apoagefit$residuals
#names(datapoagefit)[2] = "APOE"
head(ApoeAgeFit.df)
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE | age_group | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <dbl> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 1 | 74 | 215 | 26.2 | 367 | 1 | 2 | 4 | 1 | 189.4598 | 25.540210 | 0.007996603 | 20.49065 | 1.800917e-03 | 1.2505458 |
| 2 | 1 | 51 | 204 | 24.7 | 150 | 2 | 1 | 4 | 0 | 182.3334 | 21.666631 | 0.004353214 | 20.50223 | 7.004024e-04 | 1.0589378 |
| 3 | 0 | 64 | 205 | 24.2 | 213 | 0 | 1 | 4 | 1 | 186.3613 | 18.638654 | 0.004828439 | 20.50984 | 5.754470e-04 | 0.9111656 |
| 4 | 0 | 34 | 182 | 23.8 | 111 | 1 | 1 | 1 | 0 | 162.0978 | 19.902242 | 0.023712206 | 20.50632 | 3.348002e-03 | 0.9823015 |
| 5 | 1 | 52 | 175 | 34.1 | 328 | 0 | 0 | 1 | 0 | 167.6750 | 7.325043 | 0.019609361 | 20.52814 | 3.719212e-04 | 0.3607799 |
| 6 | 1 | 39 | 176 | 22.7 | 53 | 0 | 2 | 4 | 0 | 178.6152 | -2.615237 | 0.007570120 | 20.53111 | 1.786037e-05 | -0.1280244 |
# Plot the ANCOVA model
# 1. Data
ggplot(ApoeAgeFit.df) +
# 2. Aesthetics
aes(.fitted, .resid, colour=as.factor(APOE)) +
# 4. Geoms
geom_point() +
geom_hline(yintercept=0, color="black")
# Plot the residuals of our ApoeAgeFit model
ggplot(ApoeAgeFit.df) +
aes(sample = .resid) +
stat_qq() +
stat_qq_line()
The bartlett test tells us that the variances are significantly different between the APOE genotype groups, so the assumption of equal variance is false. We can also see this by plotting the residuals for the model with APOE genotype + age, whose pattern looks like an arrowhead. The qq-plot, however, looks reasonable.
The consequences of violating the assumptions for linear models depends, of course, on the assumption being violated. The worst offence, of course, is having non-linearity of your parameters in which case you are using the wrong model.
Our last example had a case of non-constant variance (heteroscedasticity). This means that there is a mean-variance relationship (recall the tornado shape). In this case the parameter estimates are minimally impacted, however variance estimates are incorrect.
To account for this we can use:
Data transformation can solve some nonlinearity, unequal variance and non-normality problems when applied to the dependent variable, the independent variable, or both. However, interpreting the results of these transformations can be tricky.
logTGFit <- lm(log(TG) ~ age, data = cholesterol)
summary(logTGFit)
logTGFit.df <- augment(logTGFit, data=cholesterol)
#logdat$.resid = logfit$residuals
head(logTGFit.df)
Call:
lm(formula = log(TG) ~ age, data = cholesterol)
Residuals:
Min 1Q Median 3Q Max
-0.75656 -0.20390 -0.02207 0.17910 1.06931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7115803 0.0559237 66.37 <2e-16 ***
age 0.0248646 0.0009866 25.20 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2844 on 398 degrees of freedom
Multiple R-squared: 0.6148, Adjusted R-squared: 0.6138
F-statistic: 635.2 on 1 and 398 DF, p-value: < 2.2e-16
| ID | sex | age | chol | BMI | TG | rs174548 | rs4775401 | APOE | age_group | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <int> | <dbl> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 1 | 74 | 215 | 26.2 | 367 | 1 | 2 | 4 | 1 | 5.551564 | 0.35379781 | 0.006926541 | 0.2841709 | 5.435798e-03 | 1.2484729 |
| 2 | 1 | 51 | 204 | 24.7 | 150 | 2 | 1 | 4 | 0 | 4.979677 | 0.03095808 | 0.002675863 | 0.2847247 | 1.594185e-05 | 0.1090111 |
| 3 | 0 | 64 | 205 | 24.2 | 213 | 0 | 1 | 4 | 1 | 5.302918 | 0.05837458 | 0.003513746 | 0.2847138 | 7.455463e-05 | 0.2056377 |
| 4 | 0 | 34 | 182 | 23.8 | 111 | 1 | 1 | 1 | 0 | 4.556978 | 0.15255195 | 0.007718507 | 0.2846252 | 1.127972e-03 | 0.5385363 |
| 5 | 1 | 52 | 175 | 34.1 | 328 | 0 | 0 | 1 | 0 | 5.004542 | 0.78847175 | 0.002595885 | 0.2819584 | 1.003032e-02 | 2.7762926 |
| 6 | 1 | 39 | 176 | 22.7 | 53 | 0 | 2 | 4 | 0 | 4.681301 | -0.71100956 | 0.005513219 | 0.2824715 | 1.742438e-02 | -2.5072094 |
# plot our log-transformed TG data model
# 1. Data
ggplot(logTGFit.df) +
# 2. Aesthetics
aes(.fitted, .resid) +
# 4. Geoms
geom_point() +
geom_hline(yintercept=0, color="black")
We corrected the non-constant variance issue, but it is harder to interpret our model. Exploring how to address these issues is beyond the scope of this appendix but some additional resources include:
gee package, a generalized esimation equation solver similar to lm() except you include an additional 'id' variable to solve.glm(), a generalized linear model function that can use distributions other than the normal Gaussian.